[SageMaker] Custom Docker Containers with SageMaker Debugger
开启一个系列,有必要研读并实践:https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html
【1】Ref: amazon-sagemaker-examples/advanced_functionality/custom-training-containers/basic-training-container/notebook/basic_training_container.ipynb 【有传参的例子】
【2】官方专门的“训练”的例子:Amazon SageMaker Debugger Examples【可以直接跑,运行即可,推荐!】
一、开始训练
from sagemaker.estimator import Estimator from sagemaker import get_execution_role role = get_execution_role() estimator = Estimator( image_name=byoc_image_uri, role=role, train_instance_count=1, train_instance_type="ml.p3.16xlarge", # Debugger-specific parameters rules = rules, debugger_hook_config=hook_config ) estimator.fit(wait=False)
二、监督训练
job_name = estimator.latest_training_job.name print('Training job name: {}'.format(job_name)) client = estimator.sagemaker_session.sagemaker_client description = client.describe_training_job(TrainingJobName=job_name)
轮训 查看 后台 training 的进程。
import time start = time.time() if description['TrainingJobStatus'] != 'Completed': while description['SecondaryStatus'] not in {'Training', 'Completed'}: description = client.describe_training_job(TrainingJobName=job_name) primary_status = description['TrainingJobStatus'] secondary_status = description['SecondaryStatus'] print('Current job status: [PrimaryStatus: {}, SecondaryStatus: {}]'.format(primary_status, secondary_status)) time.sleep(15) end = time.time() print("Duration: {}".format(end-start))
Log:
...
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Downloading]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Training]
Duration: 226.07888889312744
三、如何传参
方法:set_hyperparameters(...) 真是个神奇的函数。
如何给container传参呢?【1】中的代码有讲:
(1) 传入参数。
import sagemaker est = sagemaker.estimator.Estimator(container_image_uri, # <---- 自定义的 docker image role, train_instance_count=1, train_instance_type='local', # use local mode #train_instance_type='ml.m5.xlarge', base_job_name=prefix) est.set_hyperparameters(hp1='value1', hp2=300, hp3=0.001) train_config = sagemaker.session.s3_input('s3://{0}/{1}/train/'.format(bucket, prefix), content_type='text/csv') val_config = sagemaker.session.s3_input('s3://{0}/{1}/val/'.format(bucket, prefix), content_type='text/csv') est.fit({'train': train_config, 'validation': val_config })
(2) 下载好 training lib,并接收到 参数。
if __name__ == "__main__": parser = argparse.ArgumentParser() # sagemaker-containers passes hyperparameters as arguments parser.add_argument('--hp1', type=str) parser.add_argument('--hp2', type=int, default=50) parser.add_argument('--hp3', type=float, default=0.1) # This is a way to pass additional arguments when running as a script # and use sagemaker-containers defaults to set their values when not specified. parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN']) parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION']) args = parser.parse_args() train(args.hp1, args.hp2, args.hp3, args.train, args.validation)
最好使用该链接的Docker,将自己的代码集成进去,从而确保“参数”可以顺利传进去。
如果要考虑tf的环境问题,以及GPU docker的问题,可参考与sagemaker有关的docker如下:
注意,默认的启动目录:copy的bug,保持目录结构的话,记得去掉星号。
# Copies code under /opt/ml/code where sagemaker-containers expects to find the script to run COPY code/* /opt/ml/code/
Get started with SageMaker Debugger
Ref: NEW! Amazon SageMaker Studio - Debug Models with Amazon SageMaker Debugger
重点 。。。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· 震惊!C++程序真的从main开始吗?99%的程序员都答错了
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 单元测试从入门到精通
· 上周热点回顾(3.3-3.9)
· winform 绘制太阳,地球,月球 运作规律