[SageMaker] Custom Docker Containers with SageMaker Debugger
开启一个系列,有必要研读并实践:https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html
【1】Ref: amazon-sagemaker-examples/advanced_functionality/custom-training-containers/basic-training-container/notebook/basic_training_container.ipynb 【有传参的例子】
【2】官方专门的“训练”的例子:Amazon SageMaker Debugger Examples【可以直接跑,运行即可,推荐!】
一、开始训练
from sagemaker.estimator import Estimator from sagemaker import get_execution_role role = get_execution_role() estimator = Estimator( image_name=byoc_image_uri, role=role, train_instance_count=1, train_instance_type="ml.p3.16xlarge", # Debugger-specific parameters rules = rules, debugger_hook_config=hook_config ) estimator.fit(wait=False)
二、监督训练
job_name = estimator.latest_training_job.name print('Training job name: {}'.format(job_name)) client = estimator.sagemaker_session.sagemaker_client description = client.describe_training_job(TrainingJobName=job_name)
轮训 查看 后台 training 的进程。
import time start = time.time() if description['TrainingJobStatus'] != 'Completed': while description['SecondaryStatus'] not in {'Training', 'Completed'}: description = client.describe_training_job(TrainingJobName=job_name) primary_status = description['TrainingJobStatus'] secondary_status = description['SecondaryStatus'] print('Current job status: [PrimaryStatus: {}, SecondaryStatus: {}]'.format(primary_status, secondary_status)) time.sleep(15) end = time.time() print("Duration: {}".format(end-start))
Log:
...
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Downloading]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Training]
Duration: 226.07888889312744
三、如何传参
方法:set_hyperparameters(...) 真是个神奇的函数。
如何给container传参呢?【1】中的代码有讲:
(1) 传入参数。
import sagemaker est = sagemaker.estimator.Estimator(container_image_uri, # <---- 自定义的 docker image role, train_instance_count=1, train_instance_type='local', # use local mode #train_instance_type='ml.m5.xlarge', base_job_name=prefix) est.set_hyperparameters(hp1='value1', hp2=300, hp3=0.001) train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv') val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv') est.fit({'train': train_config, 'validation': val_config })
(2) 下载好 training lib,并接收到 参数。
if __name__ == "__main__": parser = argparse.ArgumentParser() # sagemaker-containers passes hyperparameters as arguments parser.add_argument('--hp1', type=str) parser.add_argument('--hp2', type=int, default=50) parser.add_argument('--hp3', type=float, default=0.1) # This is a way to pass additional arguments when running as a script # and use sagemaker-containers defaults to set their values when not specified. parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN']) parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION']) args = parser.parse_args() train(args.hp1, args.hp2, args.hp3, args.train, args.validation)
最好使用该链接的Docker,将自己的代码集成进去,从而确保“参数”可以顺利传进去。
如果要考虑tf的环境问题,以及GPU docker的问题,可参考与sagemaker有关的docker如下:
注意,默认的启动目录:copy的bug,保持目录结构的话,记得去掉星号。
# Copies code under /opt/ml/code where sagemaker-containers expects to find the script to run COPY code/* /opt/ml/code/
Get started with SageMaker Debugger
Ref: NEW! Amazon SageMaker Studio - Debug Models with Amazon SageMaker Debugger
重点 。。。