[TRT-LLM] TRT-LLM部署流程
TRT-LLM部署流程
1. 编译trt-cpp文件
cd TensorRT-LLM/cpp/build
export TRT_LIB_DIR=/usr/local/tensorrt
export TRT_INCLUDE_DIR=/usr/local/tensorrt/include/
cmake .. -DTRT_LIB_DIR=/usr/local/tensorrt -DTRT_INCLUDE_DIR=/usr/local/tensorrt/include -DBUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=RELEASE
make -j16
2. 编译安装python包
将编译好的cpp库文件复制到该文件lib文件夹
cp -rP TensorRT-LLM/cpp/build/lib/*.so lib/
python setup.py build
python setup.py bdist_wheel
pip install dist/tensorrt_llm-0.5.0-py3-none-any.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
3. 构建TRT engine模型
python3 hf_qwen_convert.py --smoothquant=0.5 --in-file /workspace/models/models-hf/Qwen-7B-Chat/ --dataset-cache-dir /workspace/models/cnn_dailymail/ # smooth_quant 0.5
python3 build.py --hf_model_dir /workspace/models/models-hf/Qwen-7B-Chat/ # fp16模型
python3 build.py --use_weight_only --weight_only_precision=int8 --hf_model_dir /workspace/models/models-hf/Qwen-7B-Chat/ # int8模型
python3 api.py --tokenizer_dir /workspace/models/models-hf/Qwen-7B-Chat/
python3 cli_chat.py --tokenizer_dir /workspace/models/models-hf/Qwen-7B-Chat/
4. trition 服务构建
模型转换和engine模型生成
python3 hf_qwen_convert.py --smoothquant=0.5 --in-file /workspace/models/models-hf/Qwen-7B-Chat/ --dataset-cache-dir /workspace/models/cnn_dailymail/
python3 build.py --use_smooth_quant --per_token --per_channel --use_inflight_batching --paged_kv_cache --remove_input_padding --hf_model_dir /workspace/models/models-hf/Qwen-7B-Chat/ --output_dir qwen-7b-smooth-int8
复制模型文件、修改参数
- 将cpp下构建的so文件复制到/opt/tritionserver下
cp -rP /opt/tritonserver/backends/tensorrtllm/lib/*.so.* /opt/tritonserver/backends/tensorrtllm/
cp -rP /opt/tritonserver/backends/tensorrtllm/*.so /opt/tritonserver/backends/tensorrtllm/
- 分别修改trition_model_server下的pbtxt内容,主要修改模型路径和模型类型,修改engine下的模型config.json问题
修改前后的处理的tokenizer的路径和类型:
parameters {
key: "tokenizer_dir"
value: {
string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen_7b_chat"
}
}
parameters {
key: "tokenizer_type"
value: {
string_value: "auto"
}
}
修改模型路径和参数:
流式输出
model_transaction_policy {
decoupled: true
}
inflight_batch:
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching" # v1
}
}
模型路径:
```json
parameters: {
key: "gpt_model_path"
value: {
string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1"
}
}
在plugin_config下添加等参数
"use_context_fmha_for_generation": true,
"use_paged_context_fmha": true,
- tritionserver 进行启动
tritonserver --model-repository triton_model_repo
5. docker 启动
docker run --gpus all --name trition_qwen -itd -v/home/pan/Code/TRT-hackathon/Qwen-7B-Chat-TensorRT-LLM:/workspace/Qwen-7B -v /home/pan/Public/Models:/workspace/models -p8086:8000 -p8061:8001 -p8062:8002 -w /workspace/Qwen-7B local.io/library/trtllm-torch:v1.0.1 tritonserver --model-repository triton_model_repo
本地client访问
python3 triton_client/inflight_batcher_llm_client.py --url 192.168.100.222:8061 --tokenizer_dir ~/Public/Models/models-hf/Qwen-7B-Chat/