[SageMaker] Invoked by AWS Lambda

资源 

Deploying to TensorFlow Serving Endpoints - 不大的参考价值

 

Module 13 Video Material - 深度较低,前两个稍微有点用。

  • Part 13.1: Flask and Deep Learning Web Services [Video] [Notebook]    【训练好,用训练结果搭建一个 api 服务】

  • Part 13.2: Deploying a Model to AWS [Video][Notebook]    【部署一个已有的模型到endpoint,并测试】

  • Part 13.3: Using a Keras Deep Neural Network with a Web Application [Video][Notebook]

  • Part 13.4: When to Retrain Your Neural Network [Video][Notebook]

  • Part 13.5: AI at the Edge: Using Keras on a Mobile Device [Video][Notebook]

 

 

 

 

触发"训练"


一、Lambda 触发训练任务

Ref: Retraining SageMaker models with Chalice and Serverless【示范代码来源】

Code: https://github.com/juliensimon/aws/blob/master/lambda_frameworks/chalice/sagemakertrainer/app.py

Video: Scheduling the training of a SageMaker model with a Lambda function

代码: https://github.com/juliensimon/aws/tree/master/lambda_frameworks/serverless/sagemakerscheduler【scheduler的在此不是重点】

 

  通过3个API,了解训练参数。

  HyperParameters 如何通过 CreateTrainingJob 传递参数到自定义的 docker 的问题。

 

  • Listing jobs

@app.route('/list/{results}')
def list_jobs(results):
    jobs = sm.list_training_jobs(MaxResults=int(results), SortBy="CreationTime", SortOrder="Descending")
    job_names = map(lambda job: [job['TrainingJobName'], job['TrainingJobStatus']],  jobs['TrainingJobSummaries'])
    return {'jobs': list(job_names)}
$ http $URL/list/1

 

  • Describing a job

@app.route('/get/{name}')
def get_job_by_name(name):
    job = sm.describe_training_job(TrainingJobName=name)
    return {'job': str(job)}
$ http get $URL/job/imageclassification-2018-05-12

 

  • Retraining a job

其实也无需接收这么多参数,直接 create_training_job 即可。 

@app.route('/train/{name}', methods=['POST'])
def train_job_by_name(name):
    job = sm.describe_training_job(TrainingJobName=name)

    body = app.current_request.json_body
    if 'TrainingJobName' not in body:
        raise BadRequestError('Missing new job name')
    else:
       job['TrainingJobName'] = body['TrainingJobName']
if 'S3OutputPath' in body: job['OutputDataConfig']['S3OutputPath'] = body['S3OutputPath']
if 'InstanceType' in body: job['ResourceConfig']['InstanceType'] = body['InstanceType']
if 'InstanceCount' in body: job['ResourceConfig']['InstanceCount'] = int(body['InstanceCount']) if 'VpcConfig' in job: resp = sm.create_training_job( TrainingJobName=job['TrainingJobName'], AlgorithmSpecification=job['AlgorithmSpecification'], RoleArn=job['RoleArn'], InputDataConfig=job['InputDataConfig'], OutputDataConfig=job['OutputDataConfig'], ResourceConfig=job['ResourceConfig'], StoppingCondition=job['StoppingCondition'], HyperParameters=job['HyperParameters'] if 'HyperParameters' in job else {}, VpcConfig=job['VpcConfig'], Tags=job['Tags'] if 'Tags' in job else []) else: # Because VpcConfig cannot be empty like HyperParameters or Tags :-/ resp = sm.create_training_job( TrainingJobName=job['TrainingJobName'], AlgorithmSpecification=job['AlgorithmSpecification'], RoleArn=job['RoleArn'], InputDataConfig=job['InputDataConfig'], OutputDataConfig=job['OutputDataConfig'], ResourceConfig=job['ResourceConfig'], StoppingCondition=job['StoppingCondition'], HyperParameters=job['HyperParameters'] if 'HyperParameters' in job else {}, Tags=job['Tags'] if 'Tags' in job else [])
return {'ResponseMetadata': resp['ResponseMetadata']}
$ http post $URL/train/DEMO-imageclassification-2018-05-10-09-58-12 \
                       TrainingJobName=imageclassification-2018-05-12 \
                       InstanceType=ml.p3.2xlarge 
InstanceCount=2

 

 

二、异步

Ref: End-to-End Multiclass Image Classification Example

HyperParameters 修改如下,可用。

training_params = \
{
    # specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.c5.2xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
#         "image_shape": image_shape,
#         "num_layers": str(num_layers),
#         "num_training_samples": str(num_training_samples),
#         "num_classes": str(num_classes),
#         "mini_batch_size": str(mini_batch_size),
#         "epochs": str(epochs),
#         "learning_rate": str(learning_rate),
        "train": "dataset",
        "test": "dataset",
        "pretrain": "INPUT/my_model.h5",
        "output": "OUTPUT",
        "epoch": str(3),
        "step": str(30)
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
    # Training data should be inside a subdirectory called "train"
    # Validation data should be inside a subdirectory called "validation"
    # The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
#     "InputDataConfig": [
#         {
#             "ChannelName": "train",
#             "DataSource": {
#                 "S3DataSource": {
#                     "S3DataType": "S3Prefix",
#                     "S3Uri": s3_train,
#                     "S3DataDistributionType": "FullyReplicated"
#                 }
#             },
#             "ContentType": "application/x-recordio",
#             "CompressionType": "None"
#         },
#         {
#             "ChannelName": "validation",
#             "DataSource": {
#                 "S3DataSource": {
#                     "S3DataType": "S3Prefix",
#                     "S3Uri": s3_validation,
#                     "S3DataDistributionType": "FullyReplicated"
#                 }
#             },
#             "ContentType": "application/x-recordio",
#             "CompressionType": "None"
#         }
#     ]
}
HyperParameters

但 Lambda这里是不能等的,training这边也是希望能实时监控下training log的。怎么办?

 

第一步

接收任务 from 队列:[AWS] How to Use SQS as an Event Source for AWS Lambda

其实是,SQS 主动触发 该 Lambda 函数。

 

第二步

参考:[AWS] ECS - Fargate pratice step by step

       for record in event["Records"]:
            reqBody = record['body']
            dictBody = ast.literal_eval(reqBody)

            job_id = dictBody["job_id"]
            print("Job Id: ", job_id)

            start_index = dictBody["job_details"]["start_index"]
            print("Start Index: ", start_index)

            ... ...# Triggering the ECS Fargate Job
            response = client.run_task(
                cluster='detection-demo', # name of the cluster
                launchType = 'FARGATE',
                taskDefinition='Image-Mask:66', # replace with your task definition name and revision
                count = 1,
                platformVersion='LATEST',
                networkConfiguration={
                    'awsvpcConfiguration': {
                        'subnets': [
                            'subnet-37b06788', # replace with your public subnet or a private with NAT
                            'subnet-37b06788' # Second is optional, but good idea to have two
                        ],
                        'assignPublicIp': 'ENABLED'
                    }
                },
                overrides={
                    'containerOverrides': [
                        {
                            'name': 'demo-fargate',
                            'environment': [
                                {
                                    'name': 'job_id',
                                    'value': job_id
                                },
                                {
                                    'name': 'start_index',
                                    'value': start_index
                                },
                                ... ...
                            ]
                        },
                    ]
                })

 

第三步

写入container的环境变量,然后在runtime code中读取这些个变量们。

if __name__ == '__main__':
    try:
        print(os.environ)

        job_id = os.environ["job_id"]
        print("Job Id: " + str(job_id))
        print("Image masking started for Job id: " + str(job_id) )
        current_time = str(int((time.time() + 0.5) * 1000))

        start_index = int(os.environ["start_index"])
        print("Starting Index: " + str(start_index))
    ... ...

 

 

三、按次收费

最便宜的GPU,串行,免费。

较好的GPU,并行,按需收费,例如:一次训练十分钟,按次数缴费。

 

  • 选型与定价 

Ref: https://aws.amazon.com/sagemaker/pricing/ 

Accelerated ComputingvCPUMemoryPrice per Hour 
ml.p3.2xlarge 8 61 GiB $3.825  
ml.p3.8xlarge 32 244 GiB $14.688  
ml.p3.16xlarge 64 488 GiB $28.152  
ml.g4dn.xlarge 4 16 GiB $0.7364 1100 USD PER MONTH
ml.g4dn.2xlarge 8 32 GiB $1.0528  
ml.g4dn.4xlarge 16 64 GiB $1.6856  
ml.g4dn.8xlarge 32 128 GiB $3.0464  
ml.g4dn.12xlarge 48 192 GiB $5.4768  
ml.g4dn.16xlarge 64 256 GiB $6.0928  

 

Ref: https://www.amazonaws.cn/en/ec2/instance-types/

Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to 100 Gbps of networking throughput for machine learning and HPC applications. These instances deliver up to one petaflop of mixed-precision performance per instance to significantly accelerate machine learning and high performance computing applications. Amazon EC2 P3 instances have been proven to reduce machine learning training times from days to minutes, as well as increase the number of simulations completed for high performance computing by 3-4x.  

    • p3.2xlarge:  1 GPU, 8 vCPUs,   61 GiB of memory, up to 10 Gbps network performance
    • p3.8xlarge,  4 GPU, 32 vCPUs, 244 GiB of memory, 10 Gbps network performance
    • p3.16xlarge, 8 GPU, 64 vCPUs, 488 GiB of memory, 25 Gbps network performance

Amazon EC2 G4 instances deliver a cost-effective GPU instance for deploying machine learning models in production and graphics-intensive applications. G4 instances provide the latest generation NVIDIA T4 GPUs, AWS custom Intel Cascade Lake CPUs, up to 100 Gbps of networking throughput, and up to 1.8 TB of local NVMe storage. These instances deliver up to 65 TFLOPs of FP16 performance to accelerate machine learning inference applications and ray-tracing cores to accelerate graphics workloads such as graphics workstations, video transcoding, and game streaming in the cloud.

  • g4dn.xlarge,   1 GPU,  4 vCPUs,   16 GiB of memory,   125 NVMe SSD, up to 25 Gbps network performance
  • g4dn.2xlarge,  1 GPU,  8 vCPUs,   32 GiB of memory,   225 NVMe SSD, up to 25 Gbps network performance
  • g4dn.4xlarge,  1 GPU,  16 vCPUs,  64 GiB of memory,   225 NVMe SSD, up to 25 Gbps network performance
  • g4dn.8xlarge,  1 GPU,  32 vCPUs, 128 GiB of memory, 1x900 NVMe SSD, 50 Gbps network performance
  • g4dn.16xlarge, 1 GPU,  64 vCPUs, 256 GiB of memory, 1x900 NVMe SSD, 50 Gbps network performance
  • g4dn.12xlarge, 4 GPUs, 48 vCPUs, 192 GiB of memory, 1x900 NVMe SSD, 50 Gbps network performance

 

  • 性能对比

NVIDIA Tesla GPU系列P4、T4、P40以及V100是Tesla GPU系列的明星产品。

作为对比:FP32时,2080 TI 速度是Tesla V100的80%;FP16时,是Tesla V100的82%。在价格上,2080 Ti是Tesla V100的1/5。

云服务器吧Tesla T4:世界领先的推理加速器Tesla V100:通用数据中心 GPU适用于超高效、外扩型服务器的 Tesla P4适用于推理吞吐量服务器的 Tesla P40
单精度性能 (FP32) 8.1 TFLOPS 14 TFLOPS (PCIe)
15.7 teraflops (SXM2)
5.5 TFLOPS 12 TFLOPS
半精度性能 (FP16) 65 TFLOPS 112 TFLOPS (PCIe)
125 TFLOPS (SXM2)
整数运算能力 (INT8) 130 TOPS 22 TOPS* 47 TOPS*
整数运算能力 (INT4) 260 TOPS
GPU 显存 16GB 32/16GB HBM2 8GB 24GB
显存带宽 320GB/秒 900GB/秒 192GB/秒 346GB/秒
系统接口/外形规格 PCI Express 半高外形 PCI Express 双插槽全高外形 SXM2/NVLink PCI Express 半高外形 PCI Express 双插槽全高外形
功率 70 W 250 W (PCIe)
300 W (SXM2)
50 W/75 W 250 W
硬件加速视频引擎 1 个解码引擎,2 个编码引擎 1 个解码引擎,2 个编码引擎 1 个解码引擎,2 个编码引擎

 

 

 

 

 

触发"推断"


视频:Invoking a SageMaker model from a Lambda function 

代码:https://github.com/juliensimon/aws/tree/master/lambda_frameworks/chalice/predictor

 

一、部署推断 

  • High level 部署

[Elastic Inference]

Deploy the best model on a GPU instance, and with Elastic Inference.

c5.large+eia1.medium give you performance comparable to p2.xlarge at 77% discount.

You'll save $754 per instance per month

只是加了个 accelerator_type,性能差不多,价格却不同。

Ref: https://gitlab.com/juliensimon/amazon-studio-demos/blob/master/elastic-inference.ipynb

ic_endpoint = ic.deploy(initial_instance_count=1,
                        instance_type='ml.p2.xlarge', # $1.361/hour in eu-west-1
                        endpoint_name='ic-endpoint')  

ic_endpoint_ei = ic.deploy(initial_instance_count=1,
                         instance_type='ml.c5.large',        # $0.134/hour in eu-west-1
                         accelerator_type='ml.eia2.medium',  # $0.181/hour in eu-west-1
                         endpoint_name='ic-endpoint-ei')

 

  • Boto3 部署

(1)model

%%time
import boto3
from time import gmtime, strftime

sage = boto3.Session().client(service_name='sagemaker') 

#############################################################
model_name
="DEMO-full-image-classification-model-" + time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime()) print(model_name)
#############################################################
info
= sage.describe_training_job(TrainingJobName=job_name) model_data = info['ModelArtifacts']['S3ModelArtifacts'] print(model_data) hosting_image = get_image_uri(boto3.Session().region_name, 'image-classification')
#------------------------------------------------------------
primary_container
= { 'Image': hosting_image, 'ModelDataUrl': model_data, }

############################################################# create_model_response
= sage.create_model( ModelName = model_name, ExecutionRoleArn = role, PrimaryContainer = primary_container) print(create_model_response['ModelArn'])

 

(2)endpoint config

from time import gmtime, strftime

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_config_name = job_name_prefix + '-epc-' + timestamp
endpoint_config_response
= sage.create_endpoint_config(                     EndpointConfigName = endpoint_config_name,                     ProductionVariants=[{                     'InstanceType':'ml.m4.xlarge',                     'InitialInstanceCount':1,                     'ModelName':model_name,                     'AcceleratorType': 'ml.eia1.large',                     'VariantName':'AllTraffic'}]) print('Endpoint configuration name: {}'.format(endpoint_config_name)) print('Endpoint configuration arn: {}'.format(endpoint_config_response['EndpointConfigArn']))

 

(3)endpoint

%%time
import time

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = job_name_prefix + '-ep-' + timestamp
print('Endpoint name: {}'.format(endpoint_name))

endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': endpoint_config_name,
}
endpoint_response = sagemaker.create_endpoint(**endpoint_params)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

等待创建完毕。

# get the status of the endpoint
response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))

# wait until the status has changed
sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)

# print the status of the endpoint
endpoint_response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))

if status != 'InService':
    raise Exception('Endpoint creation failed.')

有了 endpoint server,开始计时收费,可以被触发。

 

 

二、Lambda 触发 runtime 服务

juliensimon/aws/lambda_frameworks/chalice/predictor: 
 
chalice 比较类似 flask,与 zappa 异曲同工之妙。
from chalice import Chalice
from chalice import BadRequestError
import base64, os, boto3, ast
import numpy as np

app = Chalice(app_name='predictor')
app.debug=True

@app.route('/', methods=['POST'])
def index():
    body = app.current_request.json_body

    if 'data' not in body:
        raise BadRequestError('Missing image data')
    if 'ENDPOINT_NAME' not in os.environ:
        raise BadRequestError('Missing endpoint')

    image    = base64.b64decode(body['data']) # byte array
    endpoint = os.environ['ENDPOINT_NAME'] 
    
    if 'topk' not in body:
        topk = 257
    else:
        topk = body['topk']

    print("%s %d" % (endpoint, topk))
    
# ------------------------------------------------------ runtime
= boto3.Session().client(service_name='sagemaker-runtime', region_name='us-east-1')  # 也可以添加 security key等,请见 Part 13.2 2020初的 commit response = runtime.invoke_endpoint(EndpointName=endpoint, ContentType='application/x-image', Body=image) probs = response['Body'].read().decode() # byte array # ------------------------------------------------------
probs
= ast.literal_eval(probs) # array of floats probs = np.array(probs) # numpy array of floats topk_indexes = probs.argsort() # indexes in ascending order of probabilities topk_indexes = topk_indexes[::-1][:topk] # indexes for top k probabilities in descending order topk_categories = [] for i in topk_indexes: topk_categories.append((i+1, probs[i])) return {'response': str(topk_categories)}

 

 

三、runtime的收费 

Ref: Amazon SageMaker Pricing

Ref: https://www.ec2instances.info/【粗略估计,一个月 $200】 

放在内存如下,保证实时性,太贵。还是硬盘里放着好了。

  单模型
终端节点
多模型
终端节点
每月的终端节点总价格 171360 USD 1017 USD
终端节点实例类型 ml.c5.large ml.r5.2xlarge
内存容量 (GiB) 4 64
每小时的终端节点价格 0.119 USD 0.706 USD
每个终端节点的实例数量 2 2
1000 个模型所需的终端节点 1000 1
终端节点第 90 个百分位延迟 (ms) 7 7

 

End. 

       for record in event["Records"]:            reqBody = record['body']            dictBody = ast.literal_eval(reqBody)
            job_id = dictBody["job_id"]            print("Job Id: ", job_id)
            start_index = dictBody["job_details"]["start_index"]            print("Start Index: ", start_index)
            max_length = dictBody["job_details"]["max_length"]            print("Max Length: ", max_length)
            current_model_path = dictBody["job_details"]["model"]["current"]            print("Current Model Path: ", current_model_path)
            input = dictBody["job_details"]["input"]            print("Input: ", input)
            output = dictBody["job_details"]["output"]            print("Output: ", output)
            # Triggering the ECS Fargate Job            response = client.run_task(                cluster='detection-mask', # name of the cluster                launchType = 'FARGATE',                taskDefinition='Image-Mask:6', # replace with your task definition name and revision                count = 1,                platformVersion='LATEST',                networkConfiguration={                    'awsvpcConfiguration': {                        'subnets': [                            'subnet-37b06739', # replace with your public subnet or a private with NAT                            'subnet-37b06739' # Second is optional, but good idea to have two                        ],                        'assignPublicIp': 'ENABLED'                    }                },                overrides={                    'containerOverrides': [                        {                            'name': 'detection-mask-fargate',                            'environment': [                                {                                    'name': 'job_id',                                    'value': job_id                                },                        {                                    'name': 'start_index',                                    'value': start_index                                },                        {                                    'name': 'max_length',                                    'value': max_length                                },                                {                                    'name': 'current_model',                                    'value': current_model_path                                },                        {                                    'name': 'input',                                    'value': input                                },                        {                                    'name': 'output',                                    'value': output                                }                            ]                        },                    ]                })

posted @ 2021-01-09 16:05  郝壹贰叁  阅读(182)  评论(0编辑  收藏  举报