Triton部署mmdeploy导出的TensorRT模型失败篇
记录一下历程,最终没有部署成功,应该是Ubantu系统版本的问题。现在没有时间搞了,先记录一下,后续用到再填坑。
Triton demo
git clone -b r22.06 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh
# 构建并启动容器1的服务
docker run --gpus=1 --rm --net=host -v /home/xbsj/gaoying/triton/triton_demo/server/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:22.06-py3 tritonserver --model-repository=/models
# 进入容器2,准备发送请求
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:22.06-py3-sdk
# 在容器2中发送请求
/workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
Triton安装及启动服务(docker)
triton容器与cuda,tensorrt对应: Release Notes :: NVIDIA Deep Learning Triton Inference Server Documentation
更详细的在这: Frameworks Support Matrix :: NVIDIA Deep Learning Frameworks Documentation
1️⃣ Triton安装
拉取docker镜像,20.11是版本号, 可以去这里挑选:Triton Inference Server (Formerly TensorRT inference Server) | NVIDIA NGC
新建一个Dockerfile.triton文件,内容如下
FROM nvcr.io/nvidia/tritonserver:20.11-py3
RUN
保存并推出,运行下面命令安装triton 的 docker。先创建Dockerfile.triton文件再安装的好处是,可以把镜像命名为triton:2104,方便查看。并且如果想对triton docker镜像添加一些操作的话,可以在Dockerfile.triton文件中继续添加。
nvidia-docker build -f Dockerfile.triton -t triton:2011 .
2️⃣ 模型配置文件编写
新建一个本地目录,用于映射到docker容器
映射目录配置
.
└── model_rep # 宿主机要映射的根目录
├── demo1 # 模型1
│ ├── 1 # 模型版本号
│ │ └── model.pt # 模型
│ ├── 2 # 模型版本号
│ │ └── model.pt # 模型
│ └── config.pbtxt
└── demo2 # 模型2
├── 1
│ └── model.pt
└── config.pbtxt
模型配置文件编写
下面是一个用Netron软件打开的onnx格式的模型。我们可以看到输入,输出的名称,以及类型。我们根据这个修改配置文件中的input和output。下面是faster_rcnn_r50_trt的onnx模型文件,以及faster_rcnn_r50_trt的配置文件。
下面是对应上边模型的config.pbtxt配置文件
name: "faster_rcnn_r50_trt" # 模型名,也是目录名
platform: "tensorrt_plan" # 模型对应的平台,参考文章下面给出的表格
max_batch_size : 8 # 一次送入模型的最大batch_size。
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3,-1,-1 ] # 第一个维度默认是batch size,不用咱们配置。因此我们从第二个维度开始配置。
# 如果是可变维度,我们就用 -1
}
]
output [
{
name: "dets"
data_type: TYPE_FP32
dims: [-1,-1]
},
{
name: "labels"
data_type: TYPE_INT32
dims: [ -1 ]
}
]
default_model_filename: "end2end.engine"
框架与platform对应表格:
框架名 | platform |
---|---|
TensorRT | tensorrt_plan |
TensorFlow SavedModel | tensorflow_savedmodel |
TensorFlow GraphDef | tensorflow_graphdef |
ONNX | onnxruntime_onnx |
Torch | pytorch_libtorch |
输入输出data_type对应表格:
Model Config | TensorRT | TensorFlow | ONNX Runtime | PyTorch | API | NumPy |
---|---|---|---|---|---|---|
TYPE_BOOL | kBOOL | DT_BOOL | BOOL | kBool | BOOL | bool |
TYPE_UINT8 | DT_UINT8 | UINT8 | kByte | UINT8 | uint8 | |
TYPE_UINT16 | DT_UINT16 | UINT16 | UINT16 | uint16 | ||
TYPE_UINT32 | DT_UINT32 | UINT32 | UINT32 | uint32 | ||
TYPE_UINT64 | DT_UINT64 | UINT64 | UINT64 | uint64 | ||
TYPE_INT8 | kINT8 | DT_INT8 | INT8 | kChar | INT8 | int8 |
TYPE_INT16 | DT_INT16 | INT16 | kShort | INT16 | int16 | |
TYPE_INT32 | kINT32 | DT_INT32 | INT32 | kInt | INT32 | int32 |
TYPE_INT64 | DT_INT64 | INT64 | kLong | INT64 | int64 | |
TYPE_FP16 | kHALF | DT_HALF | FLOAT16 | FP16 | float16 | |
TYPE_FP32 | kFLOAT | DT_FLOAT | FLOAT | kFloat | FP32 | float32 |
TYPE_FP64 | DT_DOUBLE | DOUBLE | kDouble | FP64 | float64 | |
TYPE_STRING | DT_STRING | STRING | BYTES | dtype(object) |
3️⃣ 启动服务
🔸 启动并执行服务:
--gpus all代表启用gpus
/home/xbsj/gaoying/triton/model_rep/:/models 本地目录映射到docker目录
8000为http端口,8001为grpc端口
nvcr.io/nvidia/tritonserver:21.11-py3,版本号记得改成自己的。
docker run --gpus all -p8000:8000 -p8001:8001 -p8002:8002 -v /home/xbsj/gaoying/triton/model_rep:/model_rep -v /home/xbsj/gaoying/triton/plugin_rep:/plugin_rep --env LD_PRELOAD=/plugin_rep/libmmdeploy_tensorrt_ops.so triton:2201 tritonserver --model-repository=/model_rep
🔸 进入docker,启动服务
docker run --gpus=all --network=host --shm-size=2g -v /home/xbsj/gaoying/triton/model_rep/:/models -it nvcr.io/nvidia/tritonserver:21.04-py3 # 进入 docker
./bin/tritonserver --model-repository=/models # 启动 triton
docker run --gpus=all --network=host -v /home/xbsj/gaoying/triton/model_rep:/opt/ml/model -it triton:2104 # 进入 docker
./bin/tritonserver --model-repository=/models # 启动 triton
客户端测试接口
1️⃣ 命令行接口测试
测试命令是否准备好,宿主机命令行运行
curl -v localhost:8000/v2/health/ready
成功结果:
Trying 127.0.0.1...
TCP_NODELAY set
Connected to localhost (127.0.0.1) port 8000 (#0)
GET /v2/health/ready HTTP/1.1
Host: localhost:8000
User-Agent: curl/7.58.0
Accept: /< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
<Connection #0 to host localhost left intact
2️⃣ triton client 接口测试
🔸 grpc
faster rcnn r50 十个迭代用时: 1.0688064098358154
import os
import time
import numpy as np
import tritonclient.grpc as grpcclient
from PIL import Image
def client_init(url="localhost:8001",
ssl=False, private_key=None, root_certificates=None, certificate_chain=None,
verbose=False):
triton_client = grpcclient.InferenceServerClient(
url=url,
verbose=verbose,
ssl=ssl,
root_certificates=root_certificates,
private_key=private_key,
certificate_chain=certificate_chain)
return triton_client
def infer_faster_rcnn_r50_trt_grpc(triton_client, model_name, input='input', dets='dets', labels='labels',
compression_algorithm=None):
inputs = []
outputs = []
# 添加输入的数据
inputs.append(grpcclient.InferInput(input, [1, 3, 427, 640], "FP32"))
# 给输入的数据赋值
root_dir = os.getcwd()
img_path = os.path.join(root_dir, 'demo.jpg') # 自己把一张图片命名为demo.jpg放到目录下
img = np.array(Image.open(img_path))
img = img.astype(np.float32)
img = img.transpose((2, 0, 1))
img = np.expand_dims(img, axis=0) # (1, 3, 427, 640)
inputs[0].set_data_from_numpy(img)
# 添加输出的数据
outputs.append(grpcclient.InferRequestedOutput(dets))
outputs.append(grpcclient.InferRequestedOutput(labels))
results = triton_client.infer(
model_name=model_name,
inputs=inputs,
outputs=outputs,
compression_algorithm=compression_algorithm
# client_timeout=0.1
)
# print('=' * 50)
print(results)
# print('=' * 50)
# # 转化为numpy格式
# print(results.as_numpy(output0))
# print('=' * 50)
# print(results.as_numpy(output1))
# print('=' * 50)
if __name__ == '__main__':
client = client_init()
st = time.time()
for i in range(10):
infer_faster_rcnn_r50_trt_grpc(triton_client=client, model_name='faster_rcnn_r50_trt')
print("grpc faster rcnn r50 十个迭代用时: {}".format(time.time() - st))
🔸 http
http faster rcnn r50 十个迭代用时:1.1643376350402832
import os
import time
import gevent.ssl
import numpy as np
import tritonclient.http as httpclient
from PIL import Image
def client_init(url="localhost:8000",
ssl=False, key_file=None, cert_file=None, ca_certs=None, insecure=False,
verbose=False):
if ssl:
ssl_options = {}
if key_file is not None:
ssl_options['keyfile'] = key_file
if cert_file is not None:
ssl_options['certfile'] = cert_file
if ca_certs is not None:
ssl_options['ca_certs'] = ca_certs
ssl_context_factory = None
if insecure:
ssl_context_factory = gevent.ssl._create_unverified_context
triton_client = httpclient.InferenceServerClient(
url=url,
verbose=verbose,
ssl=True,
ssl_options=ssl_options,
insecure=insecure,
ssl_context_factory=ssl_context_factory)
else:
triton_client = httpclient.InferenceServerClient(
url=url, verbose=verbose)
return triton_client
def infer_faster_rcnn_r50_trt_http(triton_client, model_name='faster_rcnn_r50_trt',
input='input', output0='dets', output1='labels',
request_compression_algorithm=None,
response_compression_algorithm=None):
inputs = []
outputs = []
# 添加输入的数据
inputs.append(httpclient.InferInput(input, [1, 3, 427, 640], "FP32"))
# 给输入的数据赋值
root_dir = os.getcwd()
img_path = os.path.join(root_dir, 'demo.jpg') # 自己把一张图片命名为demo.jpg放到目录下
img = np.array(Image.open(img_path))
img = img.astype(np.float32)
img = img.transpose((2, 0, 1))
img = np.expand_dims(img, axis=0) # (1, 3, 427, 640)
inputs[0].set_data_from_numpy(img)
# OUTPUT0、OUTPUT1为配置文件中的输出节点名称
outputs.append(httpclient.InferRequestedOutput(output0, binary_data=False))
outputs.append(httpclient.InferRequestedOutput(output1, binary_data=False))
results = triton_client.infer(
model_name=model_name,
inputs=inputs,
outputs=outputs,
request_compression_algorithm=request_compression_algorithm,
response_compression_algorithm=response_compression_algorithm)
# print('=' * 50)
print(results)
# print('=' * 50)
# # 转化为numpy格式
# print(results.as_numpy(output0))
# print('=' * 50)
# print(results.as_numpy(output1))
# print('=' * 50)
if __name__ == '__main__':
triton_client = client_init()
st=time.time()
for i in range(10):
infer_faster_rcnn_r50_trt_http(triton_client)
print("http faster rcnn r50 十个迭代用时:{}".format(time.time()-st))
3️⃣ requests 接口测试
requests faster rcnn r50 十个迭代用时: 3.843385934829712
import os
import time
import numpy as np
from PIL import Image
import requests
def infer_demo_torch_http():
url = 'http://localhost:8000/v2/models/demo_torch/versions/1/infer'
data = {
"inputs": [{
"name": "input__0",
"shape": [2, 3],
"datatype": "INT64",
"data": [[1, 2, 3], [4, 5, 6]]
}],
"outputs": [{"name": "output__0"}, {"name": "output__1"}]
}
headers = {'Content-Type': 'application/json'}
res = requests.post(url, json=data, headers=headers).json()
print(res)
def infer_demo_onnx_http():
url = 'http://localhost:8000/v2/models/demo_onnx/versions/1/infer'
data = {
"inputs": [{
"name": "INPUT0",
"shape": [8, 2],
"datatype": "FP32",
"data": [[0.1] * 2 for _ in range(8)]
}, {
"name": "INPUT1",
"shape": [8, 2],
"datatype": "INT32",
"data": [[1] * 2 for _ in range(8)]
}],
"outputs": [{"name": "OUTPUT0"}, {"name": "OUTPUT1"}]
}
headers = {'Content-Type': 'application/json'}
res = requests.post(url, json=data, headers=headers).json()
print(res)
def infer_faster_rcnn_r50_onnx_http():
root_dir = os.getcwd()
img_path = os.path.join(root_dir, 'demo.jpg')
img = np.array(Image.open(img_path))
img = img.astype(np.float32)
img = img.transpose((2, 0, 1))
img = np.expand_dims(img, axis=0) # (1, 3, 427, 640)
# img = np.repeat(img, repeats=2, axis=0) # (2, 3, 427, 640)
img = img.tolist()
url = 'http://localhost:8000/v2/models/faster_rcnn_r50_onnx/versions/1/infer'
data = {
"inputs": [{
"name": "input",
"shape": [1, 3, 427, 640],
"datatype": "FP32",
"data": img
}, ],
"outputs": [{"name": "dets"}, {"name": "labels"}]
}
headers = {'Content-Type': 'application/json'}
res = requests.post(url, json=data, headers=headers).json()
print(res)
def infer_faster_rcnn_r50_trt_http():
root_dir = os.getcwd()
img_path = os.path.join(root_dir, 'demo.jpg')
img = np.array(Image.open(img_path))
img = img.astype(np.float32)
img = img.transpose((2, 0, 1))
img = np.expand_dims(img, axis=0) # (1, 3, 427, 640)
img = img.tolist()
url = 'http://localhost:8000/v2/models/faster_rcnn_r50_trt/versions/1/infer'
data = {
"inputs": [{
"name": "input",
"shape": [1, 3, 427, 640],
"datatype": "FP32",
"data": img
}, ],
"outputs": [{"name": "dets"}, {"name": "labels"}]
}
headers = {'Content-Type': 'application/json'}
res = requests.post(url, json=data, headers=headers).json()
print(res)
if __name__ == "__main__":
print('=' * 50)
print('| Infer demo_torch')
print('_' * 20)
infer_demo_torch_http()
print('=' * 50)
print('| Infer demo_onnx')
print('_' * 20)
infer_demo_onnx_http()
print('=' * 50)
print('| Infer faster_rcnn_r50_onnx')
print('_' * 20)
infer_faster_rcnn_r50_onnx_http()
print('=' * 50)
print('| Infer faster_rcnn_r50_trt')
print('_' * 20)
st = time.time()
for _ in range(10):
infer_faster_rcnn_r50_trt_http()
print("requests faster rcnn r50 十个迭代用时: {}".format(time.time() - st))
print('=' * 50)
triton压测
首先构建好我们的输入数据,input.json。
{
"inputs": [{
"name": "input__0",
"shape": [2, 3],
"datatype": "INT64",
"data": [[1, 2, 3], [4, 5, 6]]
}],
"outputs": [{"name": "output__0"}, {"name": "output__1"}]
}
安装一下用到的包
sudo apt install apache2-utils
压测命令
ab -k -c 5 -n 500 -p input.json http://localhost:8000/v2/models/demo/versions/1/infer
命令的意思是5个进程反复调用接口共500次,输入数据为input.json,模型是demo模型,版本1。
triton报错合集:
⚠️ INVALID_ARGUMENT: getPluginCreator could not find plugin TRTBatchedNMS version 1
用mmdeploy docker转换出来的tensorrt模型,在triton docker中没法用,报以下错误:(triton的报错信息,刚开始我也不会看,那么一大堆,找不到关键是哪里报错。教大家一下,E开头的就是报错的)
E0630 01:31:22.566631 1 logging.cc:43] INVALID_ARGUMENT: getPluginCreator could not find plugin TRTBatchedNMS version 1
E0630 01:31:22.566657 1 logging.cc:43] safeDeserializationUtils.cpp (322) - Serialization Error in load: 0 (Cannot deserialize plugin since corresponding IPluginCreator not found in Plugin Registry)
E0630 01:31:22.566739 1 logging.cc:43] INVALID_STATE: std::exception
E0630 01:31:22.572629 1 logging.cc:43] INVALID_CONFIG: Deserialize the cuda engine failed.
E0630 01:31:22.587565 1 model_repository_manager.cc:1215] failed to load 'faster_rcnn_r50_tensorrt' version 1: Internal: unable to create TensorRT engine
🔸 方法一(推荐)
参考:yolo模型部署——tensorRT模型加速+triton服务器模型部署
直接运行下面命令(根据自己的自行修改)
docker run --gpus all -p8000:8000 -p8001:8001 -p8002:8002 -v /home/xbsj/gaoying/triton/model_rep:/model_rep -v /home/xbsj/gaoying/triton/plugin_rep:/plugin_rep --env LD_PRELOAD=/plugin_rep/libmmdeploy_tensorrt_ops.so triton:2104 tritonserver --model-repository=/model_rep
🔸 方法二
解决方法来源: end2end.engine to Triton · Issue #465 · open-mmlab/mmdeploy (github.com)
具体方法:(我试了,没成功。。。是我操作不对)
1️⃣ 将 /root/workspace/mmdeploy/build/lib/libmmdeploy_tensorrt_ops.so
从 mmdeploy docker 镜像复制到 triton docker 镜像中的 /opt/tritonserver/lib/
docker run --gpus=all --network=host -v /home/xbsj/gaoying/triton/model_rep:/opt/ml/model -it triton:2104 # 宿主机命令行运行,进入triton docker容器,但不启动服务
docker ps # 宿主机命令行运行,查看triton docker容器的id
docker cp /data/imagetd/xbsj/gaoying//mmdeploy_out/libmmdeploy_tensorrt_ops.so 7725e367f0f0:/opt/tritonserver/lib/libmmdeploy_tensorrt_ops.so # 传输文件,宿主机->triton容器
2️⃣ 将 LD_PRELOAD=libmmdeploy_tensorrt_ops.so 附加到 /bin/serve/ 的末尾,tritonserver服务之前。
vim /bin/serve
添加上下面命令,105行
LD_PRELOAD=libmmdeploy_tensorrt_ops.so
启动服务
./bin/tritonserver --model-store=/models
⚠️ ImportError: cannot import name 'ORTWrapper' from 'mmdeploy.backend.onnxruntime' (/data/imagetd/xbsj/gaoying/mmdeploy/mmdeploy/backend/onnxruntime/init.py)
解决方法来源:Bug using ORTwrapper · Issue #37 · open-mmlab/mmdeploy (github.com)
🔸 方法
在 mmdeploy/codebase/mmdet/core/post_processing/bbox_nms.py::select_nms_index
中,将return batched_dets, batched_labele
更改为 return batched_dets[:, 0:-1, :], batched_labels[:, 0:-1]
可能会修复 bug .
然后运行命令
python setup.py install
后边再进行模型转换
⚠️ Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
解决方法参考:Bug using ORTwrapper · Issue #37 · open-mmlab/mmdeploy (github.com)