【LLMOps】Triton + TensorRT-LLM部署QWen
背景
TensorRT-LLM是Nvidia官方推出的大模型推理加速框架,目前只对部分显卡型号有做定制加速。最近新出的Chat with RTX也是基于TensorRT-LLM进行的本地推理。
TensorRT-LLM支持PagedAttention、FlashAttention、SafeTensor等手动,某些社区号称吞吐能力测试结果超过vLLM。
准备
- 显卡A800
- QWen7B 预训练模型
构建镜像最好自己构建最新的。尝试使用nvidia提供的镜像,发现镜像版本滞后。而且使用后出现各种不兼容,很容易让人误以为是自身操作问题。
开始
转换权重
首先需要将QWen模型转换为TensorRT所支持的.engine格式的权重文件
环境构建
下载TensorRT-LLM的官方代码:https://github.com/NVIDIA/TensorRT-LLM.git
然后编辑 TensorRT-LLM/docker/Dockerfile.multi ,内容如下
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
1 # Multi-stage Dockerfile 2 ARG BASE_IMAGE=nvcr.io/nvidia/pytorch 3 ARG BASE_TAG=23.10-py3 4 5 FROM ${BASE_IMAGE}:${BASE_TAG} as base 6 7 # https://www.gnu.org/software/bash/manual/html_node/Bash-Startup-Files.html 8 # The default values come from `nvcr.io/nvidia/pytorch` 9 ENV BASH_ENV=${BASH_ENV:-/etc/bash.bashrc} 10 ENV ENV=${ENV:-/etc/shinit_v2} 11 SHELL ["/bin/bash", "-c"] 12 13 FROM base as devel 14 15 COPY docker/common/install_base.sh install_base.sh 16 RUN bash ./install_base.sh && rm install_base.sh 17 18 COPY cmake-3.24.4-linux-x86_64.tar.gz /tmp 19 COPY docker/common/install_cmake.sh install_cmake.sh 20 RUN bash ./install_cmake.sh && rm install_cmake.sh 21 22 COPY docker/common/install_ccache.sh install_ccache.sh 23 RUN bash ./install_ccache.sh && rm install_ccache.sh 24 25 # Download & install internal TRT release 26 ARG TRT_VER CUDA_VER CUDNN_VER NCCL_VER CUBLAS_VER 27 COPY docker/common/install_tensorrt.sh install_tensorrt.sh 28 RUN bash ./install_tensorrt.sh \ 29 --TRT_VER=${TRT_VER} \ 30 --CUDA_VER=${CUDA_VER} \ 31 --CUDNN_VER=${CUDNN_VER} \ 32 --NCCL_VER=${NCCL_VER} \ 33 --CUBLAS_VER=${CUBLAS_VER} && \ 34 rm install_tensorrt.sh 35 36 # Install latest Polygraphy 37 COPY docker/common/install_polygraphy.sh install_polygraphy.sh 38 RUN bash ./install_polygraphy.sh && rm install_polygraphy.sh 39 40 # Install mpi4py 41 COPY docker/common/install_mpi4py.sh install_mpi4py.sh 42 RUN bash ./install_mpi4py.sh && rm install_mpi4py.sh 43 44 # Install PyTorch 45 ARG TORCH_INSTALL_TYPE="skip" 46 COPY docker/common/install_pytorch.sh install_pytorch.sh 47 RUN bash ./install_pytorch.sh $TORCH_INSTALL_TYPE && rm install_pytorch.sh 48 49 FROM devel as wheel 50 WORKDIR /src/tensorrt_llm 51 COPY benchmarks benchmarks 52 COPY cpp cpp 53 COPY benchmarks benchmarks 54 COPY scripts scripts 55 COPY tensorrt_llm tensorrt_llm 56 COPY 3rdparty 3rdparty 57 COPY setup.py requirements.txt requirements-dev.txt ./ 58 59 RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/ 60 61 ARG BUILD_WHEEL_ARGS="--clean --trt_root /usr/local/tensorrt" 62 RUN python3 scripts/build_wheel.py ${BUILD_WHEEL_ARGS} 63 64 FROM devel as release 65 66 WORKDIR /app/tensorrt_llm 67 COPY --from=wheel /src/tensorrt_llm/build/tensorrt_llm*.whl . 68 COPY --from=wheel /src/tensorrt_llm/cpp/include/ include/ 69 RUN pip install tensorrt_llm*.whl --extra-index-url https://pypi.nvidia.com && \ 70 rm tensorrt_llm*.whl 71 COPY README.md ./ 72 COPY examples examples
主要是在59行加上一个pip镜像。
cd TensorRT-LLM/docker
make build
执行上述命令,构建镜像。以我这边为例,构建完的镜像名为 tensorrt-llm:v3
容器启动
docker run -it --gpus '"device=1"' --name trt-llm -v /home:/home tensorrt-llm:v3 bash
docker exec -it trt-llm bash
转换权重
进入到容器内部
cd examples/qwen pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple pip install -r requirements.txt
中间会报tensorrt版本冲突,忽略即可。
执行转换:
python3 build.py --hf_model_dir /home/Qwen-7b/ --dtype bfloat16 --paged_kv_cache --use_gpt_attention_plugin bfloat16 --enable_context_fmha --use_gemm_plugin bfloat16 --use_inflight_batching --remove_input_padding --enable_context_fmha --output /home/trt_engines_qwen7b_bf16
测试:
python3 ../run.py --input_text "请你讲述一个故事" --max_output_len=64 --tokenizer_dir /home/Qwen-7b/ --engine_dir=/home/trt_engines_qwen7b_bf16
测试结果如下:
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/generation.py:881: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:178.) torch.nested.nested_tensor(split_ids_list, Input [Text 0]: "<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user hello<|im_end|> <|im_start|>assistant " Output [Text 0 Beam 0]: "Hello! How can I help you today? Is there something you would like to talk about or ask me a question? I'm here to assist you with any information or advice you might need."
推理
构建镜像
下载triton代码:https://github.com/triton-inference-server/tensorrtllm_backend
此处有坑,构建时忘记记录了,跳过。最终构建的镜像:triton-trt-llm:v3.0
启动服务
进入到目录下执行
将tensorrtllm_backend/all_models/inflight_batcher_llm 复制到/home/tensorrtllm_backend/model_repository下
python3 tools/fill_template.py -i /home/tensorrtllm_backend/model_repository/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:/tensorrtllm_backend/model_repository/tensorrt_llm/1,max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600
其中
- batch_scheduler_policy 设置为guaranteed_no_evict
- enable_trt_overlap 设置为False
- max_num_sequences 设置为batch-size一样
- normalize_log_probs设置为False
- gpt-model-type 设置为v1
再其中
修改postprocess和postprocess中的model.py 大约在81行左右,加上self.tokenizer.eos_token = "<|endoftext|>"
启动容器 trition-trt-llm
docker run --rm -it --gpus '"device=1"' --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 -p 18000:8000 -v /home/triton-trtllm/:/tensorrtllm_backend trition-trt-llm:v3.0 bash
启动服务
pip install tiktoken cd /tensorrtllm_backend/tensorrtllm_backend # --world_size is the number of GPUs you want to use for serving python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/model_repository
请求接口
curl --location 'http://localhost:18000/v2/models/ensemble/generate' \ --header 'Content-Type: application/json' \ --data '{ "text_input": "What is machine learning?", "max_tokens": 64, "bad_words": "", "stop_words": "" }'
性能
在A800上实际测试,吞吐约为vllm的一半,RT也没有明显的下降。可能A800跟A100还是有很大区别的
其他:
国内加速镜像整理:https://www.nenufm.com/dorthl/291/