【LLMOps】Triton + TensorRT-LLM部署QWen

背景

TensorRT-LLM是Nvidia官方推出的大模型推理加速框架，目前只对部分显卡型号有做定制加速。最近新出的Chat with RTX也是基于TensorRT-LLM进行的本地推理。

TensorRT-LLM支持PagedAttention、FlashAttention、SafeTensor等手动，某些社区号称吞吐能力测试结果超过vLLM。

准备

显卡A800
QWen7B 预训练模型

构建镜像最好自己构建最新的。尝试使用nvidia提供的镜像，发现镜像版本滞后。而且使用后出现各种不兼容，很容易让人误以为是自身操作问题。

开始

转换权重

首先需要将QWen模型转换为TensorRT所支持的.engine格式的权重文件

环境构建

下载TensorRT-LLM的官方代码：https://github.com/NVIDIA/TensorRT-LLM.git

然后编辑 TensorRT-LLM/docker/Dockerfile.multi ,内容如下

 1 # Multi-stage Dockerfile
 2 ARG BASE_IMAGE=nvcr.io/nvidia/pytorch
 3 ARG BASE_TAG=23.10-py3
 4 
 5 FROM ${BASE_IMAGE}:${BASE_TAG} as base
 6 
 7 # https://www.gnu.org/software/bash/manual/html_node/Bash-Startup-Files.html
 8 # The default values come from `nvcr.io/nvidia/pytorch`
 9 ENV BASH_ENV=${BASH_ENV:-/etc/bash.bashrc}
10 ENV ENV=${ENV:-/etc/shinit_v2}
11 SHELL ["/bin/bash", "-c"]
12 
13 FROM base as devel
14 
15 COPY docker/common/install_base.sh install_base.sh
16 RUN bash ./install_base.sh && rm install_base.sh
17 
18 COPY cmake-3.24.4-linux-x86_64.tar.gz /tmp
19 COPY docker/common/install_cmake.sh install_cmake.sh
20 RUN bash ./install_cmake.sh && rm install_cmake.sh
21 
22 COPY docker/common/install_ccache.sh install_ccache.sh
23 RUN bash ./install_ccache.sh && rm install_ccache.sh
24 
25 # Download & install internal TRT release
26 ARG TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER
27 COPY docker/common/install_tensorrt.sh install_tensorrt.sh
28 RUN bash ./install_tensorrt.sh \
29     --TRT_VER=${TRT_VER} \
30     --CUDA_VER=${CUDA_VER} \
31     --CUDNN_VER=${CUDNN_VER} \
32     --NCCL_VER=${NCCL_VER} \
33     --CUBLAS_VER=${CUBLAS_VER} && \
34     rm install_tensorrt.sh
35 
36 # Install latest Polygraphy
37 COPY docker/common/install_polygraphy.sh install_polygraphy.sh
38 RUN bash ./install_polygraphy.sh && rm install_polygraphy.sh
39 
40 # Install mpi4py
41 COPY docker/common/install_mpi4py.sh install_mpi4py.sh
42 RUN bash ./install_mpi4py.sh && rm install_mpi4py.sh
43 
44 # Install PyTorch
45 ARG TORCH_INSTALL_TYPE="skip"
46 COPY docker/common/install_pytorch.sh install_pytorch.sh
47 RUN bash ./install_pytorch.sh $TORCH_INSTALL_TYPE && rm install_pytorch.sh
48 
49 FROM devel as wheel
50 WORKDIR /src/tensorrt_llm
51 COPY benchmarks benchmarks
52 COPY cpp cpp
53 COPY benchmarks benchmarks
54 COPY scripts scripts
55 COPY tensorrt_llm tensorrt_llm
56 COPY 3rdparty 3rdparty
57 COPY setup.py requirements.txt requirements-dev.txt ./
58 
59 RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
60 
61 ARG BUILD_WHEEL_ARGS="--clean --trt_root /usr/local/tensorrt"
62 RUN python3 scripts/build_wheel.py ${BUILD_WHEEL_ARGS}
63 
64 FROM devel as release
65 
66 WORKDIR /app/tensorrt_llm
67 COPY --from=wheel /src/tensorrt_llm/build/tensorrt_llm*.whl .
68 COPY --from=wheel /src/tensorrt_llm/cpp/include/ include/
69 RUN pip install tensorrt_llm*.whl --extra-index-url https://pypi.nvidia.com && \
70     rm tensorrt_llm*.whl
71 COPY README.md ./
72 COPY examples examples

View Code

主要是在59行加上一个pip镜像。

cd TensorRT-LLM/docker
make build

执行上述命令，构建镜像。以我这边为例，构建完的镜像名为 tensorrt-llm:v3

容器启动

docker run -it --gpus '"device=1"' --name trt-llm -v /home:/home tensorrt-llm:v3 bash
docker exec -it trt-llm bash

转换权重

进入到容器内部

cd examples/qwen
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install -r requirements.txt

中间会报tensorrt版本冲突，忽略即可。

执行转换：

python3 build.py --hf_model_dir  /home/Qwen-7b/ --dtype bfloat16 --paged_kv_cache  --use_gpt_attention_plugin bfloat16 --enable_context_fmha --use_gemm_plugin bfloat16 --use_inflight_batching --remove_input_padding --enable_context_fmha --output  /home/trt_engines_qwen7b_bf16

测试：

python3 ../run.py --input_text "请你讲述一个故事" --max_output_len=64 --tokenizer_dir /home/Qwen-7b/ --engine_dir=/home/trt_engines_qwen7b_bf16

测试结果如下：

/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py:881: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.)
 torch.nested.nested_tensor(split_ids_list,
Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: "Hello! How can I help you today? Is there something you would like to talk about or ask me a question? I'm here to assist you with any information or advice you might need."

推理

构建镜像

下载triton代码：https://github.com/triton-inference-server/tensorrtllm_backend

此处有坑，构建时忘记记录了，跳过。最终构建的镜像：triton-trt-llm:v3.0

启动服务

进入到目录下执行

将tensorrtllm_backend/all_models/inflight_batcher_llm 复制到/home/tensorrtllm_backend/model_repository下

python3 tools/fill_template.py -i /home/tensorrtllm_backend/model_repository/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/tensorrtllm_backend/model_repository/tensorrt_llm/1,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600

其中

batch_scheduler_policy 设置为guaranteed_no_evict
enable_trt_overlap 设置为False
max_num_sequences 设置为batch-size一样
normalize_log_probs设置为False
gpt-model-type 设置为v1

再其中

修改postprocess和postprocess中的model.py 大约在81行左右，加上self.tokenizer.eos_token = "<|endoftext|>"

启动容器 trition-trt-llm

docker run --rm -it --gpus '"device=1"' --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 -p 18000:8000 -v /home/triton-trtllm/:/tensorrtllm_backend trition-trt-llm:v3.0 bash

启动服务

pip install tiktoken
cd /tensorrtllm_backend/tensorrtllm_backend
# --world_size is the number of GPUs you want to use for serving
python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/model_repository

请求接口

curl --location 'http://localhost:18000/v2/models/ensemble/generate' \
--header 'Content-Type: application/json' \
--data '{
   "text_input": "What is machine learning?",
   "max_tokens": 64,
   "bad_words": "",
   "stop_words": ""
}'

性能

在A800上实际测试，吞吐约为vllm的一半，RT也没有明显的下降。可能A800跟A100还是有很大区别的

其他：

国内加速镜像整理：https://www.nenufm.com/dorthl/291/

posted @ 2024-02-20 19:07 周周周文阳阅读(774) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

我叫周文阳

【LLMOps】Triton + TensorRT-LLM部署QWen

背景

准备

开始

转换权重

环境构建

容器启动

转换权重

推理

构建镜像

启动服务

性能

公告