[SageMaker] Computer Vision & Large Scale Training ***

SageMaker Fridays Season 2, Episode 6 - Computer vision & large scale training (November 2020)

图像，并且是重头开始训练，这就体现了distributed ml training的价值。

This project explain how to train computer vision models on large scale datasets. Starting from the ImageNet dataset, we use SageMaker to train a model with the built-in algorithm for image classification and 64 GPUs! We also discuss SageMaker features that help you scale such as RecordIO files, pipe mode, distributed training and GPU instances.

数据集

Dataset: http://image-net.org/explore

实验一

21:00:49:15, data_prep.txt

25:51:49:15, notebook.

8 GPU per instance * 8 instances = 64 GPU

一、MXNet 若干知识点

Part 1: Training MXNet — part 1: MNIST

Part 2: Training on CIFAR-10

Part 3: CIFAR-10 redux

Part 4: Distributed training【牛】

Part 5: Distributed training, EFS edition【牛】

Part 1，一机多卡的情况下，使用更多的卡。

Ref: https://github.com/juliensimon/aws/tree/master/mxnet/mnist

mod = mx.mod.Module(lenet)
#mod = mx.mod.Module(lenet, context=mx.gpu(0))
#mod = mx.mod.Module(lenet, context=(mx.gpu(0), mx.gpu(1), mx.gpu(2)))

Part 2，根据挑战的数据集 CIFAR-10，换成 ResNet101。

Fine-tuning means that we’re going to keep all layers and pre-trained weights unchanged, except for the last layer

sym, arg_params, aux_params = mx.model.load_checkpoint("resnext-101",0)
mod = mx.mod.Module(symbol=sym, context=(mx.gpu(0), mx.gpu(1), mx.gpu(2), mx.gpu(3)))

Part 3, full training from scratch 的学习率

How to focus on improving validation accuracy.

Gradually reducing the learning rate is a key technique in improving validation accuracy.

But we could also do without all these parameters, thanks to the AdaDelta optimizer.

# Use ResNext-110
sym = resnext.get_symbol(10, 110, "3,32,32")
mod = mx.mod.Module(symbol=sym, context=(mx.gpu(0), mx.gpu(1), mx.gpu(2), mx.gpu(3)))
mod.bind(data_shapes=train_iter.provide_data, label_shapes=train_iter.provide_label)
mod.init_params(initializer=mx.init.Xavier())

mod.fit(train_iter, eval_data=valid_iter, optimizer='adadelta', num_epoch=epochs)

Part 4, 暂略

More precisely, it took 12+ hours using all 4 GPUs of a g2.8xlarge instance.

Could we go faster? Sure, I could use a p2.16xlarge instance. That’s as large as GPU servers get.

Even faster? We need distributed training.

As I mentioned before, it took about 12 hours to run 300 epochs on the 4 GPUs of a g2.8xlarge instance.

The combined 32 GPUs of the 4 p2.8xlarge instances did it in 91 minutes!

Part 5, 暂略

show you how to share the data set across all instances with Amazon EFS, a managed service fully compatible with NFS v4.1.

Distributed Deep Learning Made Easy，CloudFormation Stack

以上是原始的方案，2017年。现在有现成的 API 实现了嚒？

二、train_instance_count

Ref: Scale up Training of Your ML Models with Distributed Training on Amazon SageMaker

code: https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/tensorflow_script_mode_training_and_serving.ipynb

这个参数到底有么有用！？

mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role=role,
                             train_instance_count=1,
                             train_instance_type='ml.p3.2xlarge',
                             framework_version='1.15.2',
                             py_version='py3',
                             distributions={'parameter_server': {'enabled': True}})

Parameter Server - Built In Algorithms， 7:10/15:18

Parameter Server - TensorFlow Script Mode，8:14/15:18【以上属于这个类型】

日志中一些关键的部分：

"num_cpus": 8,
"num_gpus": 1,

INFO:tensorflow:loss = 0.09097028, step = 29500 (0.269 sec)
INFO:tensorflow:loss = 0.09097028, step = 29500 (0.269 sec)
INFO:tensorflow:global_step/sec: 369.804
INFO:tensorflow:global_step/sec: 369.804
INFO:tensorflow:loss = 0.15778117, step = 29600 (0.270 sec)
INFO:tensorflow:loss = 0.15778117, step = 29600 (0.270 sec)
INFO:tensorflow:global_step/sec: 366.008
INFO:tensorflow:global_step/sec: 366.008
INFO:tensorflow:loss = 0.13056618, step = 29700 (0.273 sec)
INFO:tensorflow:loss = 0.13056618, step = 29700 (0.273 sec)
INFO:tensorflow:global_step/sec: 370.761
INFO:tensorflow:global_step/sec: 370.761
INFO:tensorflow:loss = 0.026432596, step = 29800 (0.270 sec)
INFO:tensorflow:loss = 0.026432596, step = 29800 (0.270 sec)
INFO:tensorflow:global_step/sec: 369.304
INFO:tensorflow:global_step/sec: 369.304
INFO:tensorflow:loss = 0.059833053, step = 29900 (0.271 sec)
INFO:tensorflow:loss = 0.059833053, step = 29900 (0.271 sec)

INFO:tensorflow:Saving dict for global step 30000: accuracy = 0.9753, global_step = 30000, loss = 0.07867345
INFO:tensorflow:Saving dict for global step 30000: accuracy = 0.9753, global_step = 30000, loss = 0.07867345

INFO:tensorflow:Loss for final step: 0.079707995.
INFO:tensorflow:Loss for final step: 0.079707995.
debug 0111: duration is 93.74298906326294

2021-01-11 08:59:35 Uploading - Uploading generated training model
2021-01-11 08:59:35 Completed - Training job completed
Training seconds: 189
Billable seconds: 189
wall time is 344.06185269355774

作为比对，当 train_instance_count=2 时，怎么感觉没什么加速效果？

# 看上去各自独立  
"num_cpus": 8,
"num_gpus": 1,

INFO:tensorflow:loss = 0.0903343, step = 29240 (5.858 sec)
INFO:tensorflow:loss = 0.0903343, step = 29240 (5.858 sec)
INFO:tensorflow:global_step/sec: 57.0652
INFO:tensorflow:global_step/sec: 57.0652
INFO:tensorflow:loss = 0.07312758, step = 29322 (2.485 sec)
INFO:tensorflow:loss = 0.07312758, step = 29322 (2.485 sec)
INFO:tensorflow:global_step/sec: 57.8806
INFO:tensorflow:global_step/sec: 57.8806
INFO:tensorflow:global_step/sec: 56.3061
INFO:tensorflow:global_step/sec: 56.3061
INFO:tensorflow:loss = 0.045585766, step = 29465 (2.538 sec)
INFO:tensorflow:loss = 0.045585766, step = 29465 (2.538 sec)
INFO:tensorflow:global_step/sec: 56.88
INFO:tensorflow:global_step/sec: 56.88
INFO:tensorflow:loss = 0.03344103, step = 29574 (5.872 sec)
INFO:tensorflow:loss = 0.03344103, step = 29574 (5.872 sec)
INFO:tensorflow:loss = 0.09144282, step = 29608 (2.522 sec)
INFO:tensorflow:loss = 0.09144282, step = 29608 (2.522 sec)
INFO:tensorflow:global_step/sec: 56.1107
INFO:tensorflow:global_step/sec: 56.1107
INFO:tensorflow:global_step/sec: 57.3806
INFO:tensorflow:global_step/sec: 57.3806
INFO:tensorflow:loss = 0.11538148, step = 29750 (2.487 sec)
INFO:tensorflow:loss = 0.11538148, step = 29750 (2.487 sec)
INFO:tensorflow:global_step/sec: 56.2365
INFO:tensorflow:global_step/sec: 56.2365
INFO:tensorflow:loss = 0.098466985, step = 29894 (2.564 sec)
INFO:tensorflow:loss = 0.098466985, step = 29894 (2.564 sec)
INFO:tensorflow:loss = 0.1305417, step = 29906 (5.866 sec)
INFO:tensorflow:loss = 0.1305417, step = 29906 (5.866 sec)
INFO:tensorflow:global_step/sec: 56.342
INFO:tensorflow:global_step/sec: 56.342


INFO:tensorflow:Saving dict for global step 30002: accuracy = 0.9779, global_step = 30002, loss = 0.073841065
INFO:tensorflow:Saving dict for global step 30002: accuracy = 0.9779, global_step = 30002, loss = 0.073841065


INFO:tensorflow:Loss for final step: 0.15788993.
INFO:tensorflow:Loss for final step: 0.15788993.
debug 0111: duration is 545.3711259365082
 
INFO:tensorflow:Loss for final step: 0.07843175.
INFO:tensorflow:Loss for final step: 0.07843175.
debug 0111: duration is 551.2653501033783

2021-01-11 09:36:18 Completed - Training job completed
ProfilerReport-1610356840: IssuesFound
Training seconds: 1560
Billable seconds: 1560
wall time is 959.4712088108063

三、SageMaker Distributed Training

官方教程：Distributed Training

三个案例：Distributed Training Jupyter Notebook Examples

For vision (image) models, try MNIST or MaskRCNN. For language (text) models, try BERT.

NVIDIA

原来都在这里：

Ref: https://github.com/NVIDIA/DeepLearningExamples

Ref: https://github.com/HerringForks/DeepLearningExamples

Models	Framework	A100	AMP	Multi-GPU	Multi-Node	TRT	ONNX	Triton	DLC	NB
ResNet-50	PyTorch	Yes	Yes	Yes	-	Yes	-	Yes	Yes	-
ResNeXt-101	PyTorch	Yes	Yes	Yes	-	Yes	-	Yes	Yes	-
SE-ResNeXt-101	PyTorch	Yes	Yes	Yes	-	Yes	-	Yes	Yes	-
Mask R-CNN	PyTorch	Yes	Yes	Yes	-	-	-	-	-	Yes
SSD	PyTorch	Yes	Yes	Yes	-	-	-	-	-	Yes
ResNet-50	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
ResNeXt101	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
SE-ResNeXt-101	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
Mask R-CNN	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
SSD	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	Yes
U-Net Ind	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	Yes
U-Net Med	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
U-Net 3D	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
V-Net Med	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
U-Net Med	TensorFlow2	Yes	Yes	Yes	-	-	-	-	Yes	-
Mask R-CNN	TensorFlow2	Yes	Yes	Yes	-	-	-	-	Yes	-
ResNet-50	MXNet	-	Yes	Yes	-	-	-	-	-	-

MaskRCNN

Ref: https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/tensorflow/data_parallel/maskrcnn

判断时间缩减的指标 total_steps: 是总的训练的步数，等于epoch*sample_count/batch_size,(sample_count是样本总数，epoch是总的循环次数)

subnets = ['<SUBNET_ID>'] # Should be same as Subnet used for FSx. Example: subnet-0f9XXXX
security_group_ids = ['<SECURITY_GROUP_ID>'] # Should be same as Security group used for FSx. sg-03ZZZZZZ
job_name = 'tf2-smdataparallel-mrcnn-fsx' # This job name is used as prefix to the sagemaker training job. Makes it easy for your look for your training job in SageMaker Training job console.
file_system_id='<FSX_ID>' # FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'

what is

Amazon FSx Lustre file-system

/* implement */

posted @ 2020-12-06 21:06 郝壹贰叁阅读(144) 评论(0) 编辑收藏举报

刷新页面返回顶部

机器学习水很深

We all have two lives. The second one starts when we realize that we only have one. --- Tom Hiddleston

[SageMaker] Computer Vision & Large Scale Training ***

数据集

一、MXNet 若干知识点

二、train_instance_count

这个参数到底有么有用！？

三、SageMaker Distributed Training

NVIDIA

MaskRCNN

公告