[SageMaker] Custom Docker Containers with SageMaker Debugger

开启一个系列,有必要研读并实践:https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html

【2】官方专门的“训练”的例子:Amazon SageMaker Debugger Examples【可以直接跑,运行即可,推荐!】

 

 

一、开始训练

from sagemaker.estimator import Estimator
from sagemaker import get_execution_role

role = get_execution_role()

estimator = Estimator(
                image_name=byoc_image_uri,
                role=role,
                train_instance_count=1,
                train_instance_type="ml.p3.16xlarge",

                # Debugger-specific parameters
                rules = rules,
                debugger_hook_config=hook_config
            )


estimator.fit(wait=False)

 

 

二、监督训练

job_name = estimator.latest_training_job.name
print('Training job name: {}'.format(job_name))

client = estimator.sagemaker_session.sagemaker_client

description = client.describe_training_job(TrainingJobName=job_name)

轮训 查看 后台 training 的进程。

import time
start = time.time()

if description['TrainingJobStatus'] != 'Completed':
    while description['SecondaryStatus'] not in {'Training', 'Completed'}:
        description      = client.describe_training_job(TrainingJobName=job_name)
        
        primary_status   = description['TrainingJobStatus']
        secondary_status = description['SecondaryStatus']
        
        print('Current job status: [PrimaryStatus: {}, SecondaryStatus: {}]'.format(primary_status, secondary_status))
        time.sleep(15)

end = time.time()
print("Duration: {}".format(end-start))

Log: 

...
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Downloading]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Training]
Duration:
226.07888889312744

 

 

三、如何传参

方法:set_hyperparameters(...) 真是个神奇的函数。

如何给container传参呢?【1】中的代码有讲:

(1) 传入参数。

import sagemaker

est = sagemaker.estimator.Estimator(container_image_uri,  # <---- 自定义的 docker image
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='local', # use local mode
                                    #train_instance_type='ml.m5.xlarge',
                                    base_job_name=prefix)

est.set_hyperparameters(hp1='value1',
                        hp2=300,
                        hp3=0.001)

train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv')
val_config   = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv')

est.fit({'train': train_config, 'validation': val_config })

(2) 下载好 training lib,并接收到 参数。

if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    
    # sagemaker-containers passes hyperparameters as arguments
    parser.add_argument('--hp1', type=str)
    parser.add_argument('--hp2', type=int, default=50)
    parser.add_argument('--hp3', type=float, default=0.1)
    
    # This is a way to pass additional arguments when running as a script
    # and use sagemaker-containers defaults to set their values when not specified.
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])

    args = parser.parse_args()
    
    train(args.hp1, args.hp2, args.hp3, args.train, args.validation)

 

最好使用该链接的Docker,将自己的代码集成进去,从而确保“参数”可以顺利传进去。

https://github.com/aws/amazon-sagemaker-examples/blob/master/advanced_functionality/custom-training-containers/script-mode-container/docker/Dockerfile

如果要考虑tf的环境问题,以及GPU docker的问题,可参考与sagemaker有关的docker如下:

https://github.com/aws/deep-learning-containers/blob/master/tensorflow/training/docker/2.2/py3/cu102/Dockerfile.gpu

注意,默认的启动目录:copy的bug,保持目录结构的话,记得去掉星号。

# Copies code under /opt/ml/code where sagemaker-containers expects to find the script to run
COPY code/* /opt/ml/code/

 

 

  

 

 

Get started with SageMaker Debugger


Ref: NEW! Amazon SageMaker Studio - Debug Models with Amazon SageMaker Debugger

 

重点 。。。

 

 

 

 

posted @ 2020-12-17 11:16  郝壹贰叁  阅读(107)  评论(0编辑  收藏  举报