[TRT-LLM] TRT-LLM部署流程

TRT-LLM部署流程

1. 编译trt-cpp文件

cd TensorRT-LLM/cpp/build
export TRT_LIB_DIR=/usr/local/tensorrt
export TRT_INCLUDE_DIR=/usr/local/tensorrt/include/
cmake .. -DTRT_LIB_DIR=/usr/local/tensorrt -DTRT_INCLUDE_DIR=/usr/local/tensorrt/include -DBUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=RELEASE 
make -j16

2. 编译安装python包

将编译好的cpp库文件复制到该文件lib文件夹

cp -rP TensorRT-LLM/cpp/build/lib/*.so lib/
python setup.py build
python setup.py bdist_wheel
pip install dist/tensorrt_llm-0.5.0-py3-none-any.whl  -i https://pypi.tuna.tsinghua.edu.cn/simple

3. 构建TRT engine模型

python3 hf_qwen_convert.py --smoothquant=0.5 --in-file /workspace/models/models-hf/Qwen-7B-Chat/ --dataset-cache-dir /workspace/models/cnn_dailymail/  # smooth_quant 0.5
python3 build.py --hf_model_dir /workspace/models/models-hf/Qwen-7B-Chat/  # fp16模型
python3 build.py --use_weight_only --weight_only_precision=int8 --hf_model_dir /workspace/models/models-hf/Qwen-7B-Chat/ # int8模型

python3 api.py --tokenizer_dir /workspace/models/models-hf/Qwen-7B-Chat/
python3 cli_chat.py --tokenizer_dir /workspace/models/models-hf/Qwen-7B-Chat/

4. trition 服务构建

模型转换和engine模型生成

python3 hf_qwen_convert.py --smoothquant=0.5 --in-file /workspace/models/models-hf/Qwen-7B-Chat/ --dataset-cache-dir /workspace/models/cnn_dailymail/

python3 build.py --use_smooth_quant --per_token --per_channel --use_inflight_batching --paged_kv_cache --remove_input_padding --hf_model_dir /workspace/models/models-hf/Qwen-7B-Chat/ --output_dir qwen-7b-smooth-int8

复制模型文件、修改参数

  1. 将cpp下构建的so文件复制到/opt/tritionserver下
cp -rP  /opt/tritonserver/backends/tensorrtllm/lib/*.so.*  /opt/tritonserver/backends/tensorrtllm/
cp -rP /opt/tritonserver/backends/tensorrtllm/*.so  /opt/tritonserver/backends/tensorrtllm/
  1. 分别修改trition_model_server下的pbtxt内容,主要修改模型路径和模型类型,修改engine下的模型config.json问题

修改前后的处理的tokenizer的路径和类型:

parameters {
  key: "tokenizer_dir"
  value: {
	string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen_7b_chat"
  }
}
parameters {
  key: "tokenizer_type"
  value: {
	string_value: "auto"
  }
}

修改模型路径和参数:

流式输出

model_transaction_policy {
  decoupled: true  
}

inflight_batch:
parameters: {
  key: "gpt_model_type"
  value: {
	string_value: "inflight_fused_batching" # v1
  }
}

模型路径:
```json
parameters: {
  key: "gpt_model_path"
  value: {
	string_value: "/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1"
  }
}

在plugin_config下添加等参数
"use_context_fmha_for_generation": true,
"use_paged_context_fmha": true,

  1. tritionserver 进行启动
    tritonserver --model-repository triton_model_repo

5. docker 启动

docker run --gpus all --name trition_qwen -itd -v/home/pan/Code/TRT-hackathon/Qwen-7B-Chat-TensorRT-LLM:/workspace/Qwen-7B -v /home/pan/Public/Models:/workspace/models -p8086:8000 -p8061:8001 -p8062:8002 -w /workspace/Qwen-7B local.io/library/trtllm-torch:v1.0.1 tritonserver --model-repository triton_model_repo

本地client访问
python3 triton_client/inflight_batcher_llm_client.py --url 192.168.100.222:8061 --tokenizer_dir ~/Public/Models/models-hf/Qwen-7B-Chat/

posted @ 2024-06-29 12:49  wildkid1024  阅读(8)  评论(0编辑  收藏  举报