[SageMaker] DNN with Amazon SageMaker

进化过程

  训练图像分类的课程:https://www.udemy.com/course/practical-aws-sagemaker-6-real-world-case-studies/

 

一、Keras 传统例子

  • 构建与训练 

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Dense, Activation, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam

# optimizer = Adam()
ANN_model = keras.Sequential()
ANN_model.add(Dense(50, input_dim = 8))
ANN_model.add(Activation('relu'))
ANN_model.add(Dense(150))
ANN_model.add(Activation('relu'))
ANN_model.add(Dropout(0.5))
ANN_model.add(Dense(150))
ANN_model.add(Activation('relu'))
ANN_model.add(Dropout(0.5))
ANN_model.add(Dense(50))
ANN_model.add(Activation('linear'))
ANN_model.add(Dense(1))

ANN_model.compile(loss = 'mse', optimizer = 'adam')
ANN_model.summary()



ANN_model.compile(optimizer='Adam', loss='mean_squared_error')

epochs_hist = ANN_model.fit(X_train, y_train, epochs = 100, batch_size = 20, validation_split = 0.2)

 

  • 度量一

result = ANN_model.evaluate(X_test, y_test)
accuracy_ANN = 1 - result
print("Accuracy : {}".format(accuracy_ANN))

 

  • 度量二

epochs_hist.history.keys()
# dict_keys(['loss', 'val_loss']),可以查看这两个指标

plt.plot(epochs_hist.history['loss'])
plt.plot(epochs_hist.history['val_loss'])


plt.title('Model Loss Progress During Training')
plt.xlabel('Epoch')
plt.ylabel('Training and Validation Loss')
plt.legend(['Training Loss', 'Validation Loss'])

 

  • 度量三

y_predict = ANN_model.predict(X_test)
plt.plot(y_test, y_predict, "^", color = 'r')
plt.xlabel('Model Predictions')
plt.ylabel('True Values')

# 反 “去中心化” 后再可视化 y_predict_orig
= scaler_y.inverse_transform(y_predict) y_test_orig = scaler_y.inverse_transform(y_test) plt.plot(y_test_orig, y_predict_orig, "^", color = 'r') plt.xlabel('Model Predictions') plt.ylabel('True Values')

 

  • 度量四

k = X_test.shape[1]
n = len(X_test)
n

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test_orig, y_predict_orig)),'.3f'))
MSE = mean_squared_error(y_test_orig, y_predict_orig)
MAE = mean_absolute_error(y_test_orig, y_predict_orig)
r2 = r2_score(y_test_orig, y_predict_orig)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2)  
RMSE = 6096.083 
MSE = 37162224.0 
MAE = 3919.0542 
R2 = 0.798287948291905 
Adjusted R2 = 0.7920574602082573

 

 

二、SageMaker: 间接调用一个训练代码

[AWS 通用的识别接口]

Ref: AWS Innovate | Intro to Deep Learning: Building an Image Classifier on Amazon SageMaker

Ref: https://us-west-2.console.aws.amazon.com/rekognition/home?region=us-west-2#/video-analysis

 

  先瞧瞧 改为 SageMaker 的套路是怎么回事?

 

  • 传参

[Reminder] tf_estimator.fit 可不是boto3的接口哦。

from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(entry_point='train-cnn.py', 
                          role=role,
                          train_instance_count=1, 
                          train_instance_type='ml.c4.2xlarge',
                          framework_version='1.12', 
                          py_version='py3',
                          script_mode=True,
                          hyperparameters={
                              'epochs': 2 ,
                              'batch-size': 32,
                              'learning-rate': 0.001,
'gpu-count': 1} )
# 这里,是不是还可以顺便把 gpu-count, model-dir 带上? tf_estimator.fit({
'training': training_input_path, 'validation': validation_input_path})

 

  • 训练

单独的 TF training 文件,注意,该文件如何接收上面提供的超参数:hyperparameters。

import argparse, os
import numpy as np
import tensorflow
from tensorflow.keras import backend as K
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, BatchNormalization, Conv2D, MaxPooling2D, AveragePooling2D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import multi_gpu_model


# The training code will be contained in a main gaurd (if __name__ == '__main__') so SageMaker will execute the code found in the main. 
# argparse: 
if __name__ == '__main__':
    
    # Parser to get the arguments
    parser = argparse.ArgumentParser()

    # Model hyperparameters are being sent as command-line arguments.
    parser.add_argument('--epochs', type=int, default=1)
    parser.add_argument('--learning-rate', type=float, default=0.001)
    parser.add_argument('--batch-size', type=int, default=32)
    
        
    # The script receives environment variables in the training container instance. 
    # SM_NUM_GPUS: how many GPUs are available for trianing.
    # SM_MODEL_DIR: A string indicating output path where model artifcats will be sent out to.
    # SM_CHANNEL_TRAIN: path for the training channel 
    # SM_CHANNEL_VALIDATION: path for the validation channel

    parser.add_argument('--gpu-count',  type=int, default=os.environ['SM_NUM_GPUS'])
    parser.add_argument('--model-dir',  type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--training',   type=str, default=os.environ['SM_CHANNEL_TRAINING'])
    parser.add_argument('--validation', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])

    args, _ = parser.parse_known_args()
    
    # Hyperparameters
    epochs     = args.epochs
    lr         = args.learning_rate
    batch_size = args.batch_size
    gpu_count= args.gpu_count
    model_dir  = args.model_dir
    training_dir   = args.training
    validation_dir = args.validation
    
    # Loading the training and validation data from s3 bucket
    train_images = np.load(os.path.join(training_dir,   'training.npz'  ))['image']
    train_labels = np.load(os.path.join(training_dir,   'training.npz'  ))['label']
    test_images  = np.load(os.path.join(validation_dir, 'validation.npz'))['image']
    test_labels  = np.load(os.path.join(validation_dir, 'validation.npz'))['label']

    K.set_image_data_format('channels_last')

    # Adding batch dimension to the input
    train_images = train_images.reshape(train_images.shape[0], 32, 32, 3)
    test_images  = test_images.reshape(test_images.shape[0], 32, 32, 3)
    input_shape  = (32, 32, 3)
    
    # Normalizing the data
    train_images = train_images.astype('float32')
    test_images  = test_images.astype('float32')
    train_images /= 255
    test_images  /= 255

    train_labels = tensorflow.keras.utils.to_categorical(train_labels, 43)
    test_labels  = tensorflow.keras.utils.to_categorical(test_labels, 43)

    
    
    #LeNet Network Architecture
    
    model = Sequential()
    model.add(Conv2D(filters=6, kernel_size=(5, 5), activation='relu', input_shape= input_shape))
    model.add(AveragePooling2D())
    model.add(Conv2D(filters=16, kernel_size=(5, 5), activation='relu'))   
    model.add(AveragePooling2D())
    model.add(Flatten())
    model.add(Dense(units=120, activation='relu'))
    model.add(Dense(units=84, activation='relu'))
    model.add(Dense(units=43, activation = 'softmax'))
    print(model.summary())

    
    # If more than one GPU is available, convert the model to multi-gpu model
    if gpu_count > 1:
        model = multi_gpu_model(model, gpus=gpu_count)

    # Compile and train the model
    model.compile(loss=tensorflow.keras.losses.categorical_crossentropy,
                  optimizer=Adam(lr=lr),
                  metrics=['accuracy'])

    model.fit(train_images, train_labels, batch_size=batch_size,
                  validation_data=(test_images, test_labels),
                  epochs=epochs,
                  verbose=2)

    # Evaluating the model
    score = model.evaluate(test_images, test_labels, verbose=0)
    print('Validation loss    :', score[0])
    print('Validation accuracy:', score[1])

    # save trained CNN Keras model to "model_dir" (path specificied earlier)
    sess = K.get_session()
    tensorflow.saved_model.simple_save(
        sess,
        os.path.join(model_dir, 'model/1'),
        inputs={'inputs': model.input},
        outputs={t.name: t for t in model.outputs})

 

一个遗留的奇怪问题,gpu变多了,为了 training duration 并未减小?

Train on 34799 samples, validate on 12630 samples
Epoch 1/2

2021-01-08 11:12:20 Training - Training image download completed. Training in progress. - 18s - loss: 1.2888 - acc: 0.6346 - val_loss: 0.8079 - val_acc: 0.7871
Epoch 2/2
 - 4s - loss: 0.3251 - acc: 0.9054 - val_loss: 0.6253 - val_acc: 0.8528
Validation loss    : 0.6252862956253952
Validation accuracy: 0.8528107680221069
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/simple_save.py:85: calling SavedModelBuilder.add_meta_graph_and_variables (from tensorflow.python.saved_model.builder_impl) with legacy_init_op is deprecated and will be removed in a future version.
Instructions for updating:
Pass your op to the equivalent parameter main_op instead.
2021-01-08 11:12:28,434 sagemaker-containers INFO     Reporting training SUCCESS

2021-01-08 11:12:40 Uploading - Uploading generated training model
2021-01-08 11:12:40 Completed - Training job completed
Training seconds: 68
Billable seconds: 68
Duration is 252.96774005889893
'gpu-count': 1
Train on 34799 samples, validate on 12630 samples
Epoch 1/2

2021-01-08 11:12:20 Training - Training image download completed. Training in progress. - 18s - loss: 1.2888 - acc: 0.6346 - val_loss: 0.8079 - val_acc: 0.7871
Epoch 2/2
 - 4s - loss: 0.3251 - acc: 0.9054 - val_loss: 0.6253 - val_acc: 0.8528
Validation loss    : 0.6252862956253952
Validation accuracy: 0.8528107680221069
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/simple_save.py:85: calling SavedModelBuilder.add_meta_graph_and_variables (from tensorflow.python.saved_model.builder_impl) with legacy_init_op is deprecated and will be removed in a future version.
Instructions for updating:
Pass your op to the equivalent parameter main_op instead.
2021-01-08 11:12:28,434 sagemaker-containers INFO     Reporting training SUCCESS

2021-01-08 11:12:40 Uploading - Uploading generated training model
2021-01-08 11:12:40 Completed - Training job completed
Training seconds: 68
Billable seconds: 68
Duration is 252.96774005889893
'gpu-count': 2

 

  • 部署

tf_predictor = tf_estimator.deploy(initial_instance_count = 1,
                         instance_type = 'ml.t2.medium',  
                         endpoint_name = tf_endpoint_name)

#########################################################
%matplotlib inline import random import matplotlib.pyplot as plt #Pre-processing the images num_samples = 5 indices = random.sample(range(X_test.shape[0] - 1), num_samples) # 这里虽然归一化了,但不影响imshow的图片显示 images = X_test[indices]/255 labels = y_test[indices] for i in range(num_samples): plt.subplot(1,num_samples,i+1) plt.imshow(images[i]) plt.title(labels[i]) plt.axis('off') # Making predictions prediction = tf_predictor.predict(images.reshape(num_samples, 32, 32, 3))['predictions'] prediction = np.array(prediction) predicted_label = prediction.argmax(axis=1) print('Predicted labels are: {}'.format(predicted_label))
#########################################################
# Deleting the end-point tf_predictor.delete_endpoint()

 

 

三、SageMaker: 内置算法

Goto: https://github.com/PacktPublishing/Learn-Amazon-SageMaker/tree/master/sdkv2/ch5

[此人 Julien Simon 很dier,也是作者]

(1) Using Script Mode with Amazon SageMaker

(2) End to end demo with Keras and Amazon SageMaker 

(3) SageMaker Fridays

 

制定了图像分类的 container,设置对应的超参数即可;相比而言,这里省略了 训练代码部分。

# Get the name of the image classification algorithm in our region

from sagemaker import image_uris

region = boto3.Session().region_name    
container = image_uris.retrieve('image-classification', region)
print(container)

# Configure the training job role
= sagemaker.get_execution_role() ic = sagemaker.estimator.Estimator(container, role, instance_count=1, instance_type='ml.p2.xlarge', output_path=s3_output)
# Set algorithm parameters ic.set_hyperparameters(num_layers
=18, # Train a Resnet-18 model use_pretrained_model=0, # Train from scratch num_classes=2, # Dogs and cats num_training_samples=22500, # Number of training samples mini_batch_size=128, resize=224, epochs=10) # Learn the training samples 10 times
# Set dataset parameters train_data
= sagemaker.TrainingInput(s3_train_path, distribution='FullyReplicated', content_type='application/x-image', s3_data_type='S3Prefix') ​val_data = sagemaker.TrainingInput(s3_val_path, distribution='FullyReplicated', content_type='application/x-image', s3_data_type='S3Prefix') train_lst_data = sagemaker.TrainingInput(s3_train_lst_path, distribution='FullyReplicated', content_type='application/x-image', s3_data_type='S3Prefix') val_lst_data = sagemaker.TrainingInput(s3_val_lst_path, distribution='FullyReplicated', content_type='application/x-image', s3_data_type='S3Prefix') s3_channels = {'train': train_data, 'validation': val_data, 'train_lst': train_lst_data, 'validation_lst': val_lst_data}
# Train the model
%%time ic.fit(inputs=s3_channels)

[Reminder] ic.fit 可不是boto3的接口哦。

 

 

四、Boto3 接口方案

  • 要点

[Reminder] 上述可见,built-in和tf接口都是用了SageMaker提供的接口,下面我们使用 boto3的接口来做一次训练。 

 

  说白了,就是 fit() 没了,替代为 create_training_job()。

  因为是在第三方运行,例如 Lambda,而不是在 Notebook,所以,boto3。

 

获取 image-classification 的 docker image。

containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/image-classification:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/image-classification:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/image-classification:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/image-classification:latest'}
training_image = containers[boto3.Session().region_name]

可以是 自定义的 镜像么?

Ref: Unable to use this library in AWS Lambda due to package size exceeded max limit #1200

配套视频:Image classification with Amazon SageMaker【本节参考】

貌似不可以,但如果 use Sagemaker Python SDK in Lambda 就还行,链接中野路子可以使用。

 

  • 超参数

(1) TF接口,默认TF image时的 log的一部分。

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training",
        "validation": "/opt/ml/input/data/validation"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "batch-size": 32,
        "learning-rate": 0.001,
        "gpu-count": 4,
        "model_dir": "s3://sagemaker-us-east-1-006635943311/sagemaker-tensorflow-scriptmode-2021-01-08-10-57-25-977/model",
        "epochs": 2
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "validation": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "sagemaker-tensorflow-scriptmode-2021-01-08-11-03-39-505",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-east-1-006635943311/sagemaker-tensorflow-scriptmode-2021-01-08-11-03-39-505/source/sourcedir.tar.gz",
    "module_name": "train-cnn",
    "network_interface_name": "eth0",
    "num_cpus": 32,
    "num_gpus": 4,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train-cnn.py"
}

 

(2) boto3 接口的 超参数设置。

training_params = \
{
    # specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p2.8xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "image_shape": image_shape,
        "num_layers": str(num_layers),
        "num_training_samples": str(num_training_samples),
        "num_classes": str(num_classes),
        "mini_batch_size": str(mini_batch_size),
        "epochs": str(epochs),
        "learning_rate": str(learning_rate),
        "use_pretrained_model": str(use_pretrained_model)
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
    # Training data should be inside a subdirectory called "train"
    # Validation data should be inside a subdirectory called "validation"
    # The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/train/cifar10'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-recordio",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/validation/cifar10'.format(bucket),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-recordio",
            "CompressionType": "None"
        }
    ]
}

 

  • 开始训练 

# (1) create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

get_waiter 监控模式:

# (1.1) confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message)) 
Training job current status: InProgress
Training job ended with status: Completed

 

[对比] client可以通过sagemaker的原生API获得,并有自己的一个套路。

client = estimator.sagemaker_session.sagemaker_client

while 监控模式:

import time

# 实时监控 training 过程
description = client.describe_training_job(TrainingJobName=job_name)

if description['TrainingJobStatus'] != 'Completed':
    while description['SecondaryStatus'] not in {'Training', 'Completed'}:
        description = client.describe_training_job(TrainingJobName=job_name)
        primary_status   = description['TrainingJobStatus']
        secondary_status = description['SecondaryStatus']
        print('Current job status: [PrimaryStatus: {}, SecondaryStatus: {}]'.format(primary_status, secondary_status))
        time.sleep(15) 
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Starting]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Downloading]
Current job status: [PrimaryStatus: InProgress, SecondaryStatus: Training]

  

 

 

 

部署与触发


Create Endpoint 

一、在 SageMaker Notebook 中

上述的部署方式,涉及到以下三个关键方法。

tf_predictor = tf_estimator.deploy(initial_instance_count = 1,
                                   instance_type = 'ml.t2.medium',  
                                   endpoint_name = tf_endpoint_name)

prediction = tf_predictor.predict(images.reshape(num_samples, 32, 32, 3))['predictions']

# Deleting the end-point
tf_predictor.delete_endpoint()

 

二、Boto3 创建 Endpoint

  • Create Model

链接 SageMaker 服务,准备 container 环境,也就是软体信息。

%%time
import boto3
from time import gmtime, strftime

sage= boto3.Session().client(service_name='sagemaker') 

###########################################################

model_name="image-classification-cifar-transfer"
###########################################################

# 获取 retrained 模型
info = sage.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

# ---------------------------------------------------------

# 获取 镜像
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/image-classification:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/image-classification:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/image-classification:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/image-classification:latest'}
hosting_image = containers[boto3.Session().region_name]

# ---------------------------------------------------------
# 构成了 推断 环境 primary_container = { 'Image': hosting_image, 'ModelDataUrl': model_data, }

于是,创建了 endpoint 基本环境: 也就是 model 的运行环境 primary container

create_model_response =sage.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn']) 

 有了 ModelArn

  

  • Create Endpoint Configuration

何为配置?关于实例的硬件选择等信息。

from time import gmtime, strftime

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_config_name = job_name_prefix + '-epc-' + timestamp

endpoint_config_response =sage.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants = [{
        'InstanceType':'ml.m4.xlarge',
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print('Endpoint configuration name: {}'.format(endpoint_config_name))
print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

 有了 EndpointConfigArn

 

  • Create Endpoint

可见,Endpoint 包含了 EndpointConfig,EndpointConfig 包含了 Model。

%%time
import time

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = job_name_prefix + '-ep-' + timestamp
endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': endpoint_config_name,
}
endpoint_response = sagemaker.create_endpoint(**endpoint_params)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn'])) 

 有了EndpointArn

 

三、部署一个现成的 模型

例如,模型已经在s3上备好。

来自于加:t81_558_deep_learning/t81_558_class_13_02_cloud.ipynb,老版本。

The entry_point file "train.py" can be an empty Python file. AWS currently requires this step; however, AWS will likely remove this requirement at a later date.

from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model
= TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz', role = role, framework_version = '1.12', entry_point = 'train.py')
%%time predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

 

 

 

Endpoint for inference

链接 runtime 服务。

import boto3
runtime = boto3.Session().client(service_name='runtime.sagemaker') 

通过名字进行定位endpoint服务,也就是“推断模型”。

import json
import numpy as np
with open(file_name, 'rb') as f:
    payload = f.read()
    payload = bytearray(payload)
    
#
# 上述 备好 input 数据,即开始 推断。
#
response = runtime.invoke_endpoint(EndpointName=endpoint_name,   # <---- 定位 endpoint
                                   ContentType='application/x-image', 
                                   Body=payload)
result = response['Body'].read()
# result will be in json format and convert it to ndarray

 

End.

posted @ 2020-11-16 15:16  郝壹贰叁  阅读(162)  评论(0编辑  收藏  举报