SageMaker使用托管容器训练本地网络模型

1.实验简介

本实验使用SageMaker的托管容器来训练本地编写好的代码，并将代码上传到jupyter notebook中调用SageMaker中的API接口调用托管容器来进行模型的训练。

1.1项目结构目录

实验代码放在TensorFlow文件夹下面，其中models放的是神经网络的model代码，tools是编写好的数据加载器。sagemaker.ipynp为调用SageMaker的容器API代码（我将通过此来申请SageMaker中的容器来训练模型）。

|—TensorFlow

|—models

|—tools

|—sagemaker.ipynp

1.2脚本模式

为了方便，这里使用的是SageMaker托管容器中的脚本模式，该模式使用十分便捷，只需要编写好代码和根据要求使用托管容器中的训练环境变量来编写脚本就行了。

1.3使用的深度学习框架

SageMaker支持众多深度学习框架：Mxnet，Tensorflow，PyTorch（国内用的最多）等等。

本人使用的框架是tensorflow2.x版本，tensorflow2.x高度封装，与tensorflow1.x相比更加便捷和强大。并且tensorflow2.x使用的是热计算模式。

2.编写经典深度学习模型VGG16

本模型以VGG16为baseline，VGG16有强大的拟合能力，为了防止过拟合，我分别使用了Dropout，BatchNormalization，regularizers策略，并且对数据集进行了数据增强。

import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten
from tensorflow.keras.layers import Conv2D, MaxPooling2D, BatchNormalization
from tensorflow.keras import regularizers

def vgg_16_optim_Sequential():
    weight_decay = 0.0005
    model = Sequential()
    
    model.add(Conv2D(64, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.3))
  
    model.add(Conv2D(64, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 2),strides=(2,2),padding='same')  )
    
    model.add(Conv2D(128, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.4))

    model.add(Conv2D(128, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(256, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.4))

    model.add(Conv2D(256, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.4))

    model.add(Conv2D(256, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(512, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.4))

    model.add(Conv2D(512, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.4))

    model.add(Conv2D(512, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(512, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.4))

    model.add(Conv2D(512, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.4))

    model.add(Conv2D(512, (3, 3), padding='same',kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.5))

    model.add(Flatten())
    model.add(Dense(512,kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    
    model.add(Dense(512,kernel_regularizer=regularizers.l2(weight_decay)))
    model.add(Activation('relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    
    model.add(Dense(3))
    model.add(Activation('softmax'))
    return model

3.编写数据加载器

这里使用的是tensorflow中的提供的TFRecord数据结构，对应的文件后缀为.tfrecords。TFRecord更提供高效的数据输入流，我们将无需将整个数据集读入到内存中就对数据集进行批量的操作。这里我提供了基于对图片的色泽改变的数据在线增强功能。而图片的旋转和随机缩放裁剪等增强功能使用的离线增强（不在展示代码中）。

import tensorflow as tf
import os
import numpy as np
import random

class tfrec_pre:
    def __init__(self,rootpath,train_val_percentage=None,train_val_abs=None,resize=None):
        self.dataroot = rootpath
        self.tfrecord_train = os.path.join(rootpath , 'train.tfrecords')
        self.tfrecord_val = os.path.join(rootpath , 'val.tfrecords')
        self.train_val_percentage = train_val_percentage
        self.train_val_abs = train_val_abs
        self.resize = resize
        
        self.classlist = []
        # 定义Feature结构，告诉解码器每个Feature的类型是什么
        self.feature_description = { 
            'image': tf.io.FixedLenFeature([], tf.string),
            'label': tf.io.FixedLenFeature([], tf.int64),
        }

        
    def _parse_example_jpeg(self,example_string):
        # 将 TFRecord 文件中的每一个序列化的 tf.train.Example 解码
        feature_dict = tf.io.parse_single_example(example_string, self.feature_description)
        feature_dict['image'] = tf.io.decode_jpeg(feature_dict['image'],channels=3)    # 解码JPEG图片
        return feature_dict['image'], feature_dict['label']
    
    def _parse_example_png(self,example_string):
        feature_dict = tf.io.parse_single_example(example_string, self.feature_description)
        feature_dict['image'] = tf.io.decode_png(feature_dict['image'],channels=3)    # 解码PNG图片
        return feature_dict['image'], feature_dict['label']
    
    def _resize_img(self,img,label):
        #图片归一
        image_resized = tf.image.resize(img, self.resize) / 255.0
        return image_resized,label
    
    def _Data_Augmentation(self,image,label):
         #随机水平翻转图像
        #image=tf.image.random_flip_left_right(img)
        #随机改变图像的亮度
        image=tf.image.random_brightness(image,0.1)
        #随机改变对比度
        image=tf.image.random_contrast(image,0.9,1.1)
        #随机改变饱和度
        image = tf.image.random_saturation(image,0.9,1.1)
        #随机裁剪
        #image = tf.random_crop(image,[120,120,3])
        #随机改变色调
        image = tf.image.random_hue(image,0.1)
        return image,label
    
    
    #只有train集
    def _tfrec_writer(self,train_filenames,train_labels):
        with tf.io.TFRecordWriter(self.tfrecord_train) as writer:
            for filename, label in zip(train_filenames, train_labels):
                image = open(filename, 'rb').read()     # 读取数据集图片到内存，image 为一个 Byte 类型的字符串
                feature = {                             # 建立 tf.train.Feature 字典
                    'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),  # 图片是一个 Bytes 对象
                    'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))   # 标签是一个 Int 对象
                }
                example = tf.train.Example(features=tf.train.Features(feature=feature)) # 通过字典建立 Example
                writer.write(example.SerializeToString())   # 将Example序列化并写入 TFRecord 文件
        
    #按绝对比例划分（train，val）                  
    def _tfrec_writer_abs(self,train_filenames,train_labels,absper=None):
        dataset=list(zip(train_filenames,train_labels))
        random.shuffle(dataset) 
        data_num = len(dataset)       
        train_num = int(data_num * absper)
        
        with tf.io.TFRecordWriter(self.tfrecord_train) as train:
            print('开始写入train集')
            for (filename,label) in dataset[:train_num-1]:
                
                image = open(filename, 'rb').read()     # 读取数据集图片到内存，image 为一个 Byte 类型的字符串
                feature = {                             # 建立 tf.train.Feature 字典
                    'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),  # 图片是一个 Bytes 对象
                    'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))   # 标签是一个 Int 对象
                }
                example = tf.train.Example(features=tf.train.Features(feature=feature)) # 通过字典建立 Example
                train.write(example.SerializeToString())   # 将Example序列化并写入 TFRecord 文件
            
        with tf.io.TFRecordWriter(self.tfrecord_val) as val:
            print('开始写入val集')
            for (filename,label) in dataset[train_num:]:
                
                image = open(filename, 'rb').read()     # 读取数据集图片到内存，image 为一个 Byte 类型的字符串
                feature = {                             # 建立 tf.train.Feature 字典
                    'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),  # 图片是一个 Bytes 对象
                    'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))   # 标签是一个 Int 对象
                }
                example = tf.train.Example(features=tf.train.Features(feature=feature)) # 通过字典建立 Example
                val.write(example.SerializeToString())   # 将Example序列化并写入 TFRecord 文件
                            
    #按概率划分（train，val）                        
    def _tfrec_writer_percentage(self,train_filenames,train_labels,percentage=None):
        with tf.io.TFRecordWriter(self.tfrecord_train) as train:
            with tf.io.TFRecordWriter(self.tfrecord_val) as val:
                #choices = np.random.choice([0, 1], size=1000, p=[percentage, 1-percentage])
                for (filename,label) in zip(train_filenames, train_labels):
                    image = open(filename, 'rb').read()     # 读取数据集图片到内存，image 为一个 Byte 类型的字符串
                    feature = {                             # 建立 tf.train.Feature 字典
                        'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),  # 图片是一个 Bytes 对象
                        'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))   # 标签是一个 Int 对象
                    }
                    example = tf.train.Example(features=tf.train.Features(feature=feature)) # 通过字典建立 Example
                    choice=np.random.choice([0, 1],p=[percentage, 1-percentage])
                    if choice==0:
                        train.write(example.SerializeToString())   # 将Example序列化并写入 TFRecord 文件
                    else:
                        val.write(example.SerializeToString())   # 将Example序列化并写入 TFRecord 文件
        
    def generate(self):
        train_filenames = []
        train_labels = []
        
        for root,dirs,files in os.walk(self.dataroot):
            
            for dirname in dirs:
                #将目录名作为分类
                self.classlist.append(dirname)
                
            if os.path.split(root)[-1] in self.classlist:
                #获取目录名
                classname=os.path.split(root)[-1]
                new_filenames = [os.path.join(root,filename) for filename in files]
                train_filenames = train_filenames+new_filenames
                #找到目录名对应的下标作为这类别的标签
                train_labels = train_labels +[self.classlist.index(classname)] * len(new_filenames)

                
        if self.train_val_percentage == None and self.train_val_abs==None:
            self._tfrec_writer(train_filenames,train_labels)
        if self.train_val_percentage == None and self.train_val_abs!=None:
            self._tfrec_writer_abs(train_filenames,train_labels,self.train_val_abs)
        if self.train_val_percentage != None and self.train_val_abs==None:
            self._tfrec_writer_percentage(train_filenames,train_labels,self.train_val_percentage)
        if self.train_val_percentage != None and self.train_val_abs!=None:
            raise RuntimeError('不能同时使用参数train_val_abs和train_val_percentage')
        
    def load_tfrec_jpeg(self,filename):
        raw_dataset = tf.data.TFRecordDataset(os.path.join(self.dataroot,filename))    # 读取 TFRecord 文件
        dataset = raw_dataset.map(self._parse_example_jpeg,num_parallel_calls=tf.data.experimental.AUTOTUNE)
        dataset = dataset.map(self._resize_img,num_parallel_calls=tf.data.experimental.AUTOTUNE)
        return dataset
    
    def load_tfrec_png(self,filename):
        raw_dataset = tf.data.TFRecordDataset(os.path.join(self.dataroot,filename))    # 读取 TFRecord 文件
        dataset = raw_dataset.map(self._parse_example_png)
        dataset = dataset.map(self._resize_img)
        return dataset

    
    def load_tfrec_augdata(self,filename):
        raw_dataset = tf.data.TFRecordDataset(os.path.join(self.dataroot,filename))    # 读取 TFRecord 文件
        dataset = raw_dataset.map(self._parse_example_png,num_parallel_calls=tf.data.experimental.AUTOTUNE)
        dataset = dataset.map(self._resize_img,num_parallel_calls=tf.data.experimental.AUTOTUNE)
        dataset = dataset.map(self._Data_Augmentation,num_parallel_calls=tf.data.experimental.AUTOTUNE)
        return dataset

4.脚本代码

这里我将展示脚本代码代码中的核心部分（使用容器中的训练环境变量），其他关于模型调用和加载代码这里将不在描述。

def parse_args():
   
    parser = argparse.ArgumentParser()
    # hyperparameters sent by the client are passed as command-line arguments to the script
    parser.add_argument('--epochs', type=int, default=5)
    parser.add_argument('--batch_size', type=int, default=32)
    parser.add_argument('--learning_rate', type=float, default=0.001)
    
    # data directories
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    
    # model directory: we will use the default set by SageMaker, /opt/ml/model
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))  
    return parser.parse_known_args()

4.1使用训练环境变量的原因

因为脚本模式是通过命令调用的形式进行代码的调用的（python xxx.py --epochs 5 --batch_size 32 ·······），所以必须通过传参来控制模型和代码，而在sagemaker托管的容器中通过若干的训练环境变量来描述模型的数据集路径和模型输出路径。

打个比方：在模型训练前，托管容器将从S3中下载数据集到容器中的指定目录，我们的训练模型将从指定的目录中读取数据，并且需要将训练好的模型保存到指定目录，当sagemaker销毁容器时，指定的目录将会被保留，并将指定目录的数据上传到S3中进行存储的托管。

4.2相关的环境变量

重要的环境变量如下：

SM_CHANNEL_TRAIN 数据集中训练集的路径
SM_CHANNEL_TEST 数据集中测试集的路径
SM_MODEL_DIR 模型保存的路径

···············

5.注意事项

因为使用的托管模式，而训练模型的过程中的相关指标将不会打印到jupyter notebook的控制台中，想要在jupyter 中看到输出结果，请编写TeosorBoard来捕获。

6.sagemaker API的调用代码

要调用sagemaker托管容器，需要呼叫相应的API接口来创建容器和开始训练。

6.1导入sagemaker SDK库

import sagemaker
from sagemaker.tensorflow import TensorFlow
import os

6.2确定数据输入路径

bucket = 'your-bucketname' 
prefix = 'Tensorflow_demo'

#S3的训练数据路径
train_path = 's3://{}/{}/data/train/train_AUG_ALL.tfrecords'.format(bucket,prefix)
test_path = 's3://{}/{}/data/test/test_AUG_ALL.tfrecords'.format(bucket,prefix)
#数据数据以键值对形式
inputs = {'train':train_path,
          'test': test_path}
print(inputs)

6.3调用API来呼叫容器来托管训练

一些重要参数如下：

entry_point 运行的脚本代码（必）
source_dir 项目文件夹（必）
train_instance_type 计算实例类型（必）
train_instance_count 实例数量（Tensorflow支持分布式训练，要看你的训练代码是否支持分布（必）)
role role角色（必）
base_job_name 任务名
framework_version 框架版本（必）
py_version python版本号（必）
script_mode 是否启用脚本模式

train_instance_type = 'ml.c4.2xlarge'
hyperparameters = {'epochs': 1, 'batch_size': 32, 'learning_rate': 0.0001}
model_dir = '/opt/ml/model'

#使用托管容器
estimator = TensorFlow(entry_point='SageMaker_run.py',
                       source_dir='TensorFlow',
                       train_instance_type=train_instance_type,
                       train_instance_count=1,
                       hyperparameters=hyperparameters,
                       model_dir=model_dir,
                       role=sagemaker.get_execution_role(),
                       base_job_name='tf-DC-liangwenyao',
                       framework_version='2.1.0',
                       py_version='py3',
                       script_mode=True)

6.4启动容器训练

estimator.fit(inputs)

7.获取模型结果

去S3那里，那里会以名sagemaker-region-1-ID创建一个新的存储桶并将模型保存到里面以base_job_name命名的文件夹中

posted @ 2020-10-23 16:43 鸭梨的药丸哥阅读(41) 评论(0) 收藏举报来源

刷新页面返回顶部

yalier

SageMaker使用托管容器训练本地网络模型