LMDeploy量化部署LLM&LVM实操-书生浦语大模型实战营第二期第5节作业
书生浦语大模型实战营第二期第5节作业
本页面包括实战营第二期第五节作业的全部操作步骤。如果需要知道模型量化部署的相关知识请访问学习笔记。
作业要求
基础作业
完成以下任务,并将实现过程记录截图:
- 配置lmdeploy运行环境
- 下载internlm-chat-1.8b模型
- 以命令行方式与模型对话
进阶作业
完成以下任务,并将实现过程记录截图:
- 设置KV Cache最大占用比例为0.4,开启W4A16量化,以命令行方式与模型对话。
- 以API Server方式启动 lmdeploy,开启 W4A16量化,调整KV Cache的占用比例为0.4,分别使用命令行客户端与Gradio网页客户端与模型对话。
- 使用W4A16量化,调整KV Cache的占用比例为0.4,使用Python代码集成的方式运行internlm2-chat-1.8b模型。
- 使用 LMDeploy 运行视觉多模态大模型 llava gradio demo
- 将 LMDeploy Web Demo 部署到 OpenXLab (OpenXLab cuda 12.2 的镜像还没有 ready,可先跳过,一周之后再来做)
LMDeploy量化LLM
新建环境
因为cuda11.7-conda
的镜像与新版本的lmdeploy会出现兼容性问题。所以我们需要新建镜像为cuda12.2-conda
的开发机,选择10% A100
的GPU。
同时与之前的作业不同,这里使用studio-conda
搭建的环境是基于“预制环境”pytorch-2.1.2
的,而不是之前的internlm-base
。这个环境是一个空环境,这意味着如果需要在本地使用直接创建一个python=3.10的空conda环境就ok。
studio-conda -t lmdeploy -o pytorch-2.1.2
点击查看完整的pytorch-2.1.2
环境软件包列表
# packages in environment at /root/.conda/envs/lmdeploy:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main defaults
_openmp_mutex 5.1 1_gnu defaults
asttokens 2.4.1 pypi_0 pypi
blas 1.0 mkl defaults
brotli-python 1.0.9 py310h6a678d5_7 defaults
bzip2 1.0.8 h5eee18b_5 defaults
ca-certificates 2024.3.11 h06a4308_0 defaults
certifi 2024.2.2 py310h06a4308_0 defaults
charset-normalizer 2.0.4 pyhd3eb1b0_0 defaults
comm 0.2.2 pypi_0 pypi
cuda-cudart 12.1.105 0 nvidia
cuda-cupti 12.1.105 0 nvidia
cuda-libraries 12.1.0 0 nvidia
cuda-nvrtc 12.1.105 0 nvidia
cuda-nvtx 12.1.105 0 nvidia
cuda-opencl 12.4.127 0 nvidia
cuda-runtime 12.1.0 0 nvidia
debugpy 1.8.1 pypi_0 pypi
decorator 5.1.1 pypi_0 pypi
einops 0.7.0 pypi_0 pypi
exceptiongroup 1.2.0 pypi_0 pypi
executing 2.0.1 pypi_0 pypi
ffmpeg 4.3 hf484d3e_0 pytorch
filelock 3.13.1 py310h06a4308_0 defaults
freetype 2.12.1 h4a9f257_0 defaults
gmp 6.2.1 h295c915_3 defaults
gmpy2 2.1.2 py310heeb90bb_0 defaults
gnutls 3.6.15 he1e5248_0 defaults
idna 3.4 py310h06a4308_0 defaults
intel-openmp 2023.1.0 hdb19cb5_46306 defaults
ipykernel 6.29.4 pypi_0 pypi
ipython 8.23.0 pypi_0 pypi
jedi 0.19.1 pypi_0 pypi
jinja2 3.1.3 py310h06a4308_0 defaults
jpeg 9e h5eee18b_1 defaults
jupyter-client 8.6.1 pypi_0 pypi
jupyter-core 5.7.2 pypi_0 pypi
lame 3.100 h7b6447c_0 defaults
lcms2 2.12 h3be6417_0 defaults
ld_impl_linux-64 2.38 h1181459_1 defaults
lerc 3.0 h295c915_0 defaults
libcublas 12.1.0.26 0 nvidia
libcufft 11.0.2.4 0 nvidia
libcufile 1.9.0.20 0 nvidia
libcurand 10.3.5.119 0 nvidia
libcusolver 11.4.4.55 0 nvidia
libcusparse 12.0.2.55 0 nvidia
libdeflate 1.17 h5eee18b_1 defaults
libffi 3.4.4 h6a678d5_0 defaults
libgcc-ng 11.2.0 h1234567_1 defaults
libgomp 11.2.0 h1234567_1 defaults
libiconv 1.16 h7f8727e_2 defaults
libidn2 2.3.4 h5eee18b_0 defaults
libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
libnpp 12.0.2.50 0 nvidia
libnvjitlink 12.1.105 0 nvidia
libnvjpeg 12.1.1.14 0 nvidia
libpng 1.6.39 h5eee18b_0 defaults
libstdcxx-ng 11.2.0 h1234567_1 defaults
libtasn1 4.19.0 h5eee18b_0 defaults
libtiff 4.5.1 h6a678d5_0 defaults
libunistring 0.9.10 h27cfd23_0 defaults
libuuid 1.41.5 h5eee18b_0 defaults
libwebp-base 1.3.2 h5eee18b_0 defaults
llvm-openmp 14.0.6 h9e868ea_0 defaults
lz4-c 1.9.4 h6a678d5_0 defaults
markupsafe 2.1.3 py310h5eee18b_0 defaults
matplotlib-inline 0.1.6 pypi_0 pypi
mkl 2023.1.0 h213fc3f_46344 defaults
mkl-service 2.4.0 py310h5eee18b_1 defaults
mkl_fft 1.3.8 py310h5eee18b_0 defaults
mkl_random 1.2.4 py310hdb19cb5_0 defaults
mpc 1.1.0 h10f8cd9_1 defaults
mpfr 4.0.2 hb69a4c5_1 defaults
mpmath 1.3.0 py310h06a4308_0 defaults
ncurses 6.4 h6a678d5_0 defaults
nest-asyncio 1.6.0 pypi_0 pypi
nettle 3.7.3 hbbd107a_1 defaults
networkx 3.1 py310h06a4308_0 defaults
numpy 1.26.4 py310h5f9d8c6_0 defaults
numpy-base 1.26.4 py310hb5e798b_0 defaults
openh264 2.1.1 h4ff587b_0 defaults
openjpeg 2.4.0 h3ad879b_0 defaults
openssl 3.0.13 h7f8727e_0 defaults
packaging 24.0 pypi_0 pypi
parso 0.8.4 pypi_0 pypi
pexpect 4.9.0 pypi_0 pypi
pillow 10.2.0 py310h5eee18b_0 defaults
pip 23.3.1 py310h06a4308_0 defaults
platformdirs 4.2.0 pypi_0 pypi
prompt-toolkit 3.0.43 pypi_0 pypi
protobuf 5.26.1 pypi_0 pypi
psutil 5.9.8 pypi_0 pypi
ptyprocess 0.7.0 pypi_0 pypi
pure-eval 0.2.2 pypi_0 pypi
pygments 2.17.2 pypi_0 pypi
pysocks 1.7.1 py310h06a4308_0 defaults
python 3.10.14 h955ad1f_0 defaults
python-dateutil 2.9.0.post0 pypi_0 pypi
pytorch 2.1.2 py3.10_cuda12.1_cudnn8.9.2_0 pytorch
pytorch-cuda 12.1 ha16c6d3_5 pytorch
pytorch-mutex 1.0 cuda pytorch
pyyaml 6.0.1 py310h5eee18b_0 defaults
pyzmq 25.1.2 pypi_0 pypi
readline 8.2 h5eee18b_0 defaults
requests 2.31.0 py310h06a4308_1 defaults
setuptools 68.2.2 py310h06a4308_0 defaults
six 1.16.0 pypi_0 pypi
sqlite 3.41.2 h5eee18b_0 defaults
stack-data 0.6.3 pypi_0 pypi
sympy 1.12 py310h06a4308_0 defaults
tbb 2021.8.0 hdb19cb5_0 defaults
tk 8.6.12 h1ccaba5_0 defaults
torchaudio 2.1.2 py310_cu121 pytorch
torchtriton 2.1.0 py310 pytorch
torchvision 0.16.2 py310_cu121 pytorch
tornado 6.4 pypi_0 pypi
traitlets 5.14.2 pypi_0 pypi
typing_extensions 4.9.0 py310h06a4308_1 defaults
tzdata 2024a h04d1e81_0 defaults
urllib3 2.1.0 py310h06a4308_1 defaults
wcwidth 0.2.13 pypi_0 pypi
wheel 0.41.2 py310h06a4308_0 defaults
xz 5.4.6 h5eee18b_0 defaults
yaml 0.2.5 h7b6447c_0 defaults
zlib 1.2.13 h5eee18b_0 defaults
zstd 1.5.5 hc292b87_0 defaults
之后激活刚刚创建的虚拟环境并安装0.3.0版本的lmdeploy,等待安装结束。
conda activate lmdeploy
pip install lmdeploy[all]==0.3.0
点击查看完整的lmdeploy
环境软件包列表
# packages in environment at /root/.conda/envs/lmdeploy:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main defaults
_openmp_mutex 5.1 1_gnu defaults
accelerate 0.29.1 pypi_0 pypi
addict 2.4.0 pypi_0 pypi
aiofiles 23.2.1 pypi_0 pypi
aiohttp 3.9.3 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
altair 5.3.0 pypi_0 pypi
annotated-types 0.6.0 pypi_0 pypi
anyio 4.3.0 pypi_0 pypi
asttokens 2.4.1 pypi_0 pypi
async-timeout 4.0.3 pypi_0 pypi
attrs 23.2.0 pypi_0 pypi
blas 1.0 mkl defaults
brotli-python 1.0.9 py310h6a678d5_7 defaults
bzip2 1.0.8 h5eee18b_5 defaults
ca-certificates 2024.3.11 h06a4308_0 defaults
certifi 2024.2.2 py310h06a4308_0 defaults
charset-normalizer 2.0.4 pyhd3eb1b0_0 defaults
click 8.1.7 pypi_0 pypi
comm 0.2.2 pypi_0 pypi
contourpy 1.2.1 pypi_0 pypi
cuda-cudart 12.1.105 0 nvidia
cuda-cupti 12.1.105 0 nvidia
cuda-libraries 12.1.0 0 nvidia
cuda-nvrtc 12.1.105 0 nvidia
cuda-nvtx 12.1.105 0 nvidia
cuda-opencl 12.4.127 0 nvidia
cuda-runtime 12.1.0 0 nvidia
cycler 0.12.1 pypi_0 pypi
datasets 2.18.0 pypi_0 pypi
debugpy 1.8.1 pypi_0 pypi
decorator 5.1.1 pypi_0 pypi
dill 0.3.8 pypi_0 pypi
einops 0.7.0 pypi_0 pypi
exceptiongroup 1.2.0 pypi_0 pypi
executing 2.0.1 pypi_0 pypi
fastapi 0.110.1 pypi_0 pypi
ffmpeg 4.3 hf484d3e_0 pytorch
ffmpy 0.3.2 pypi_0 pypi
filelock 3.13.1 py310h06a4308_0 defaults
fire 0.6.0 pypi_0 pypi
fonttools 4.51.0 pypi_0 pypi
freetype 2.12.1 h4a9f257_0 defaults
frozenlist 1.4.1 pypi_0 pypi
fsspec 2024.2.0 pypi_0 pypi
gmp 6.2.1 h295c915_3 defaults
gmpy2 2.1.2 py310heeb90bb_0 defaults
gnutls 3.6.15 he1e5248_0 defaults
gradio 3.50.2 pypi_0 pypi
gradio-client 0.6.1 pypi_0 pypi
grpcio 1.62.1 pypi_0 pypi
h11 0.14.0 pypi_0 pypi
httpcore 1.0.5 pypi_0 pypi
httpx 0.27.0 pypi_0 pypi
huggingface-hub 0.22.2 pypi_0 pypi
idna 3.4 py310h06a4308_0 defaults
importlib-metadata 7.1.0 pypi_0 pypi
importlib-resources 6.4.0 pypi_0 pypi
intel-openmp 2023.1.0 hdb19cb5_46306 defaults
ipykernel 6.29.4 pypi_0 pypi
ipython 8.23.0 pypi_0 pypi
jedi 0.19.1 pypi_0 pypi
jinja2 3.1.3 py310h06a4308_0 defaults
jpeg 9e h5eee18b_1 defaults
jsonschema 4.21.1 pypi_0 pypi
jsonschema-specifications 2023.12.1 pypi_0 pypi
jupyter-client 8.6.1 pypi_0 pypi
jupyter-core 5.7.2 pypi_0 pypi
kiwisolver 1.4.5 pypi_0 pypi
lame 3.100 h7b6447c_0 defaults
lcms2 2.12 h3be6417_0 defaults
ld_impl_linux-64 2.38 h1181459_1 defaults
lerc 3.0 h295c915_0 defaults
libcublas 12.1.0.26 0 nvidia
libcufft 11.0.2.4 0 nvidia
libcufile 1.9.0.20 0 nvidia
libcurand 10.3.5.119 0 nvidia
libcusolver 11.4.4.55 0 nvidia
libcusparse 12.0.2.55 0 nvidia
libdeflate 1.17 h5eee18b_1 defaults
libffi 3.4.4 h6a678d5_0 defaults
libgcc-ng 11.2.0 h1234567_1 defaults
libgomp 11.2.0 h1234567_1 defaults
libiconv 1.16 h7f8727e_2 defaults
libidn2 2.3.4 h5eee18b_0 defaults
libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
libnpp 12.0.2.50 0 nvidia
libnvjitlink 12.1.105 0 nvidia
libnvjpeg 12.1.1.14 0 nvidia
libpng 1.6.39 h5eee18b_0 defaults
libstdcxx-ng 11.2.0 h1234567_1 defaults
libtasn1 4.19.0 h5eee18b_0 defaults
libtiff 4.5.1 h6a678d5_0 defaults
libunistring 0.9.10 h27cfd23_0 defaults
libuuid 1.41.5 h5eee18b_0 defaults
libwebp-base 1.3.2 h5eee18b_0 defaults
llvm-openmp 14.0.6 h9e868ea_0 defaults
lmdeploy 0.3.0 pypi_0 pypi
lz4-c 1.9.4 h6a678d5_0 defaults
markdown-it-py 3.0.0 pypi_0 pypi
markupsafe 2.1.3 py310h5eee18b_0 defaults
matplotlib 3.8.4 pypi_0 pypi
matplotlib-inline 0.1.6 pypi_0 pypi
mdurl 0.1.2 pypi_0 pypi
mkl 2023.1.0 h213fc3f_46344 defaults
mkl-service 2.4.0 py310h5eee18b_1 defaults
mkl_fft 1.3.8 py310h5eee18b_0 defaults
mkl_random 1.2.4 py310hdb19cb5_0 defaults
mmengine-lite 0.10.3 pypi_0 pypi
mpc 1.1.0 h10f8cd9_1 defaults
mpfr 4.0.2 hb69a4c5_1 defaults
mpmath 1.3.0 py310h06a4308_0 defaults
multidict 6.0.5 pypi_0 pypi
multiprocess 0.70.16 pypi_0 pypi
ncurses 6.4 h6a678d5_0 defaults
nest-asyncio 1.6.0 pypi_0 pypi
nettle 3.7.3 hbbd107a_1 defaults
networkx 3.1 py310h06a4308_0 defaults
numpy 1.26.4 py310h5f9d8c6_0 defaults
numpy-base 1.26.4 py310hb5e798b_0 defaults
nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
nvidia-nccl-cu12 2.21.5 pypi_0 pypi
openh264 2.1.1 h4ff587b_0 defaults
openjpeg 2.4.0 h3ad879b_0 defaults
openssl 3.0.13 h7f8727e_0 defaults
orjson 3.10.0 pypi_0 pypi
packaging 24.0 pypi_0 pypi
pandas 2.2.1 pypi_0 pypi
parso 0.8.4 pypi_0 pypi
peft 0.9.0 pypi_0 pypi
pexpect 4.9.0 pypi_0 pypi
pillow 10.2.0 py310h5eee18b_0 defaults
pip 23.3.1 py310h06a4308_0 defaults
platformdirs 4.2.0 pypi_0 pypi
prompt-toolkit 3.0.43 pypi_0 pypi
protobuf 4.25.3 pypi_0 pypi
psutil 5.9.8 pypi_0 pypi
ptyprocess 0.7.0 pypi_0 pypi
pure-eval 0.2.2 pypi_0 pypi
pyarrow 15.0.2 pypi_0 pypi
pyarrow-hotfix 0.6 pypi_0 pypi
pybind11 2.12.0 pypi_0 pypi
pydantic 2.6.4 pypi_0 pypi
pydantic-core 2.16.3 pypi_0 pypi
pydub 0.25.1 pypi_0 pypi
pygments 2.17.2 pypi_0 pypi
pynvml 11.5.0 pypi_0 pypi
pyparsing 3.1.2 pypi_0 pypi
pysocks 1.7.1 py310h06a4308_0 defaults
python 3.10.14 h955ad1f_0 defaults
python-dateutil 2.9.0.post0 pypi_0 pypi
python-multipart 0.0.9 pypi_0 pypi
python-rapidjson 1.16 pypi_0 pypi
pytorch 2.1.2 py3.10_cuda12.1_cudnn8.9.2_0 pytorch
pytorch-cuda 12.1 ha16c6d3_5 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2024.1 pypi_0 pypi
pyyaml 6.0.1 py310h5eee18b_0 defaults
pyzmq 25.1.2 pypi_0 pypi
readline 8.2 h5eee18b_0 defaults
referencing 0.34.0 pypi_0 pypi
regex 2023.12.25 pypi_0 pypi
requests 2.31.0 py310h06a4308_1 defaults
rich 13.7.1 pypi_0 pypi
rpds-py 0.18.0 pypi_0 pypi
safetensors 0.4.2 pypi_0 pypi
semantic-version 2.10.0 pypi_0 pypi
sentencepiece 0.2.0 pypi_0 pypi
setuptools 68.2.2 py310h06a4308_0 defaults
shortuuid 1.0.13 pypi_0 pypi
six 1.16.0 pypi_0 pypi
sniffio 1.3.1 pypi_0 pypi
sqlite 3.41.2 h5eee18b_0 defaults
stack-data 0.6.3 pypi_0 pypi
starlette 0.37.2 pypi_0 pypi
sympy 1.12 py310h06a4308_0 defaults
tbb 2021.8.0 hdb19cb5_0 defaults
termcolor 2.4.0 pypi_0 pypi
tiktoken 0.6.0 pypi_0 pypi
tk 8.6.12 h1ccaba5_0 defaults
tokenizers 0.15.2 pypi_0 pypi
tomli 2.0.1 pypi_0 pypi
toolz 0.12.1 pypi_0 pypi
torchaudio 2.1.2 py310_cu121 pytorch
torchtriton 2.1.0 py310 pytorch
torchvision 0.16.2 py310_cu121 pytorch
tornado 6.4 pypi_0 pypi
tqdm 4.66.2 pypi_0 pypi
traitlets 5.14.2 pypi_0 pypi
transformers 4.38.2 pypi_0 pypi
transformers-stream-generator 0.0.5 pypi_0 pypi
tritonclient 2.44.0 pypi_0 pypi
typing_extensions 4.9.0 py310h06a4308_1 defaults
tzdata 2024.1 pypi_0 pypi
urllib3 2.1.0 py310h06a4308_1 defaults
uvicorn 0.29.0 pypi_0 pypi
wcwidth 0.2.13 pypi_0 pypi
websockets 11.0.3 pypi_0 pypi
wheel 0.41.2 py310h06a4308_0 defaults
xxhash 3.4.1 pypi_0 pypi
xz 5.4.6 h5eee18b_0 defaults
yaml 0.2.5 h7b6447c_0 defaults
yapf 0.40.2 pypi_0 pypi
yarl 1.9.4 pypi_0 pypi
zipp 3.18.1 pypi_0 pypi
zlib 1.2.13 h5eee18b_0 defaults
zstd 1.5.5 hc292b87_0 defaults
下载模型
和之前一样为internlm2-chat-1_8b
模型创建软链接。链接后的路径与前几个实战内容略有不同。
cd ~
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/
量化前模型推理测试
这里的代码和第二个作业里的第一个demo没什么区别,都是加载模型以后调用model.chat()
获取模型输出。这一步的主要目的是测试模型输出是否正常以及体验模型推理速度。
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
inp = "hello"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=[])
print("[OUTPUT]", response)
inp = "please provide three suggestions about time management"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=history)
print("[OUTPUT]", response)
运行结果如下:
GPU占用如下:
对上面的代码稍加改造,测试一下模型压缩前的运行速度:
# python benchmark_transformer.py
import torch
import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
# warmup
inp = "hello"
for i in range(5):
print("Warm up...[{}/5]".format(i+1))
response, history = model.chat(tokenizer, inp, history=[])
# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
response, history = model.chat(tokenizer, inp, history=history)
total_words += len(response)
end_time = datetime.datetime.now()
delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.4f} words/s".format(speed))
记住这个速度16.4092 words/s
,后面会做对比。
模型对话:lmdeploy chat
使用 lmdeploy chat
命令就能在命令行里直接与大模型对话了,并且推理速度要快不少:
lmdeploy chat /root/internlm2-chat-1_8b
GPU使用情况如下:
在lmdeploy中,如果要输入内容给模型,需要使用两次回车键;输入“exit”并按两下回车,可以退出对话。
这个命令有许多参数,可以通过lmdeploy chat -h
查看帮助文档,改命令输出为:
usage: lmdeploy chat [-h] [--backend {pytorch,turbomind}] [--trust-remote-code]
[--meta-instruction META_INSTRUCTION] [--cap {completion,infilling,chat,python}]
[--adapters [ADAPTERS ...]] [--tp TP] [--model-name MODEL_NAME]
[--session-len SESSION_LEN] [--max-batch-size MAX_BATCH_SIZE]
[--cache-max-entry-count CACHE_MAX_ENTRY_COUNT] [--model-format {hf,llama,awq}]
[--quant-policy QUANT_POLICY] [--rope-scaling-factor ROPE_SCALING_FACTOR]
model_path
Chat with pytorch or turbomind engine.
positional arguments:
model_path The path of a model. it could be one of the following options: - i) a local
directory path of a turbomind model which is converted by `lmdeploy convert`
command or download from ii) and iii). - ii) the model_id of a lmdeploy-
quantized model hosted inside a model repo on huggingface.co, such as
"internlm/internlm-chat-20b-4bit", "lmdeploy/llama2-chat-70b-4bit", etc. -
iii) the model_id of a model hosted inside a model repo on huggingface.co,
such as "internlm/internlm-chat-7b", "qwen/qwen-7b-chat ", "baichuan-
inc/baichuan2-7b-chat" and so on. Type: str
options:
-h, --help show this help message and exit
--backend {pytorch,turbomind}
Set the inference backend. Default: turbomind. Type: str
--trust-remote-code Trust remote code for loading hf models. Default: True
--meta-instruction META_INSTRUCTION
System prompt for ChatTemplateConfig. Deprecated. Please use --chat-template
instead. Default: None. Type: str
--cap {completion,infilling,chat,python}
The capability of a model. Deprecated. Please use --chat-template instead.
Default: chat. Type: str
PyTorch engine arguments:
--adapters [ADAPTERS ...]
Used to set path(s) of lora adapter(s). One can input key-value pairs in
xxx=yyy format for multiple lora adapters. If only have one adapter, one can
only input the path of the adapter.. Default: None. Type: str
--tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
--model-name MODEL_NAME
The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b
and etc. You can run `lmdeploy list` to get the supported model names.
Default: None. Type: str
--session-len SESSION_LEN
The max session length of a sequence. Default: None. Type: int
--max-batch-size MAX_BATCH_SIZE
Maximum batch size. Default: 128. Type: int
--cache-max-entry-count CACHE_MAX_ENTRY_COUNT
The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type:
float
TurboMind engine arguments:
--tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
--model-name MODEL_NAME
The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b
and etc. You can run `lmdeploy list` to get the supported model names.
Default: None. Type: str
--session-len SESSION_LEN
The max session length of a sequence. Default: None. Type: int
--max-batch-size MAX_BATCH_SIZE
Maximum batch size. Default: 128. Type: int
--cache-max-entry-count CACHE_MAX_ENTRY_COUNT
The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type:
float
--model-format {hf,llama,awq}
The format of input model. `hf` meaning `hf_llama`, `llama` meaning
`meta_llama`, `awq` meaning the quantized model by awq. Default: None. Type:
str
--quant-policy QUANT_POLICY
Whether to use kv int8. Default: 0. Type: int
--rope-scaling-factor ROPE_SCALING_FACTOR
Rope scaling factor. Default: 0.0. Type: float
我们注意到,参数--cache-max-entry-count
的用途是控制KV缓存占用剩余显存的最大比例,默认的比例为0.8。这意味着后续作业只要更改这个参数就ok
模型量化与校准:lmdeploy lite
量化前需要安装einops
库:
pip install einops==0.7.0
之后执行
lmdeploy lite auto_awq \
/root/internlm2-chat-1_8b \
--calib-dataset 'ptb' \
--calib-samples 128 \
--calib-seqlen 1024 \
--w-bits 4 \
--w-group-size 128 \
--work-dir /root/internlm2-chat-1_8b-4bit
完成模型量化。该命令使用AWQ算法,实现模型4bit权重量化。推理引擎TurboMind提供了高效的4bit推理cuda kernel,性能是FP16的2.4倍以上。这一步耗时会非常非常的长。量化工作结束后,新的HF模型被保存到/root/internlm2-chat-1_8b-4bit
目录。
点击查看这段代码运行的输出内容
(lmdeploy) root@intern-studio-160311:~# lmdeploy lite auto_awq \
> /root/internlm2-chat-1_8b \
> --calib-dataset 'ptb' \
> --calib-samples 128 \
> --calib-seqlen 1024 \
> --w-bits 4 \
> --w-group-size 128 \
> --work-dir /root/internlm2-chat-1_8b-4bit
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.60s/it]
Move model.tok_embeddings to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.norm to GPU.
Move output to CPU.
Loading calibrate dataset ...
/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at <https://hf.co/datasets/ptb_text_only>
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
Downloading builder script: 6.50kB [00:00, 24.9MB/s]
Downloading readme: 4.21kB [00:00, 19.9MB/s]
Downloading data: 5.10MB [01:05, 78.1kB/s]
Downloading data: 400kB [00:00, 402kB/s]
Downloading data: 450kB [00:09, 48.3kB/s]
Generating train split: 100%|██████████████████████████████████████████████████| 42068/42068 [00:00<00:00, 88086.81 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████| 3761/3761 [00:00<00:00, 100075.98 examples/s]
Generating validation split: 100%|██████████████████████████████████████████████| 3370/3370 [00:00<00:00, 100399.93 examples/s]
model.layers.0, samples: 128, max gpu memory: 2.25 GB
model.layers.1, samples: 128, max gpu memory: 2.75 GB
model.layers.2, samples: 128, max gpu memory: 2.75 GB
model.layers.3, samples: 128, max gpu memory: 2.75 GB
model.layers.4, samples: 128, max gpu memory: 2.75 GB
model.layers.5, samples: 128, max gpu memory: 2.75 GB
model.layers.6, samples: 128, max gpu memory: 2.75 GB
model.layers.7, samples: 128, max gpu memory: 2.75 GB
model.layers.8, samples: 128, max gpu memory: 2.75 GB
model.layers.9, samples: 128, max gpu memory: 2.75 GB
model.layers.10, samples: 128, max gpu memory: 2.75 GB
model.layers.11, samples: 128, max gpu memory: 2.75 GB
model.layers.12, samples: 128, max gpu memory: 2.75 GB
model.layers.13, samples: 128, max gpu memory: 2.75 GB
model.layers.14, samples: 128, max gpu memory: 2.75 GB
model.layers.15, samples: 128, max gpu memory: 2.75 GB
model.layers.16, samples: 128, max gpu memory: 2.75 GB
model.layers.17, samples: 128, max gpu memory: 2.75 GB
model.layers.18, samples: 128, max gpu memory: 2.75 GB
model.layers.19, samples: 128, max gpu memory: 2.75 GB
model.layers.20, samples: 128, max gpu memory: 2.75 GB
model.layers.21, samples: 128, max gpu memory: 2.75 GB
model.layers.22, samples: 128, max gpu memory: 2.75 GB
model.layers.23, samples: 128, max gpu memory: 2.75 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
model.layers.23 smooth weight done.
model.layers.0.attention.wqkv weight packed.
model.layers.0.attention.wo weight packed.
model.layers.0.feed_forward.w1 weight packed.
model.layers.0.feed_forward.w3 weight packed.
model.layers.0.feed_forward.w2 weight packed.
model.layers.1.attention.wqkv weight packed.
model.layers.1.attention.wo weight packed.
model.layers.1.feed_forward.w1 weight packed.
model.layers.1.feed_forward.w3 weight packed.
model.layers.1.feed_forward.w2 weight packed.
model.layers.2.attention.wqkv weight packed.
model.layers.2.attention.wo weight packed.
model.layers.2.feed_forward.w1 weight packed.
model.layers.2.feed_forward.w3 weight packed.
model.layers.2.feed_forward.w2 weight packed.
model.layers.3.attention.wqkv weight packed.
model.layers.3.attention.wo weight packed.
model.layers.3.feed_forward.w1 weight packed.
model.layers.3.feed_forward.w3 weight packed.
model.layers.3.feed_forward.w2 weight packed.
model.layers.4.attention.wqkv weight packed.
model.layers.4.attention.wo weight packed.
model.layers.4.feed_forward.w1 weight packed.
model.layers.4.feed_forward.w3 weight packed.
model.layers.4.feed_forward.w2 weight packed.
model.layers.5.attention.wqkv weight packed.
model.layers.5.attention.wo weight packed.
model.layers.5.feed_forward.w1 weight packed.
model.layers.5.feed_forward.w3 weight packed.
model.layers.5.feed_forward.w2 weight packed.
model.layers.6.attention.wqkv weight packed.
model.layers.6.attention.wo weight packed.
model.layers.6.feed_forward.w1 weight packed.
model.layers.6.feed_forward.w3 weight packed.
model.layers.6.feed_forward.w2 weight packed.
model.layers.7.attention.wqkv weight packed.
model.layers.7.attention.wo weight packed.
model.layers.7.feed_forward.w1 weight packed.
model.layers.7.feed_forward.w3 weight packed.
model.layers.7.feed_forward.w2 weight packed.
model.layers.8.attention.wqkv weight packed.
model.layers.8.attention.wo weight packed.
model.layers.8.feed_forward.w1 weight packed.
model.layers.8.feed_forward.w3 weight packed.
model.layers.8.feed_forward.w2 weight packed.
model.layers.9.attention.wqkv weight packed.
model.layers.9.attention.wo weight packed.
model.layers.9.feed_forward.w1 weight packed.
model.layers.9.feed_forward.w3 weight packed.
model.layers.9.feed_forward.w2 weight packed.
model.layers.10.attention.wqkv weight packed.
model.layers.10.attention.wo weight packed.
model.layers.10.feed_forward.w1 weight packed.
model.layers.10.feed_forward.w3 weight packed.
model.layers.10.feed_forward.w2 weight packed.
model.layers.11.attention.wqkv weight packed.
model.layers.11.attention.wo weight packed.
model.layers.11.feed_forward.w1 weight packed.
model.layers.11.feed_forward.w3 weight packed.
model.layers.11.feed_forward.w2 weight packed.
model.layers.12.attention.wqkv weight packed.
model.layers.12.attention.wo weight packed.
model.layers.12.feed_forward.w1 weight packed.
model.layers.12.feed_forward.w3 weight packed.
model.layers.12.feed_forward.w2 weight packed.
model.layers.13.attention.wqkv weight packed.
model.layers.13.attention.wo weight packed.
model.layers.13.feed_forward.w1 weight packed.
model.layers.13.feed_forward.w3 weight packed.
model.layers.13.feed_forward.w2 weight packed.
model.layers.14.attention.wqkv weight packed.
model.layers.14.attention.wo weight packed.
model.layers.14.feed_forward.w1 weight packed.
model.layers.14.feed_forward.w3 weight packed.
model.layers.14.feed_forward.w2 weight packed.
model.layers.15.attention.wqkv weight packed.
model.layers.15.attention.wo weight packed.
model.layers.15.feed_forward.w1 weight packed.
model.layers.15.feed_forward.w3 weight packed.
model.layers.15.feed_forward.w2 weight packed.
model.layers.16.attention.wqkv weight packed.
model.layers.16.attention.wo weight packed.
model.layers.16.feed_forward.w1 weight packed.
model.layers.16.feed_forward.w3 weight packed.
model.layers.16.feed_forward.w2 weight packed.
model.layers.17.attention.wqkv weight packed.
model.layers.17.attention.wo weight packed.
model.layers.17.feed_forward.w1 weight packed.
model.layers.17.feed_forward.w3 weight packed.
model.layers.17.feed_forward.w2 weight packed.
model.layers.18.attention.wqkv weight packed.
model.layers.18.attention.wo weight packed.
model.layers.18.feed_forward.w1 weight packed.
model.layers.18.feed_forward.w3 weight packed.
model.layers.18.feed_forward.w2 weight packed.
model.layers.19.attention.wqkv weight packed.
model.layers.19.attention.wo weight packed.
model.layers.19.feed_forward.w1 weight packed.
model.layers.19.feed_forward.w3 weight packed.
model.layers.19.feed_forward.w2 weight packed.
model.layers.20.attention.wqkv weight packed.
model.layers.20.attention.wo weight packed.
model.layers.20.feed_forward.w1 weight packed.
model.layers.20.feed_forward.w3 weight packed.
model.layers.20.feed_forward.w2 weight packed.
model.layers.21.attention.wqkv weight packed.
model.layers.21.attention.wo weight packed.
model.layers.21.feed_forward.w1 weight packed.
model.layers.21.feed_forward.w3 weight packed.
model.layers.21.feed_forward.w2 weight packed.
model.layers.22.attention.wqkv weight packed.
model.layers.22.attention.wo weight packed.
model.layers.22.feed_forward.w1 weight packed.
model.layers.22.feed_forward.w3 weight packed.
model.layers.22.feed_forward.w2 weight packed.
model.layers.23.attention.wqkv weight packed.
model.layers.23.attention.wo weight packed.
model.layers.23.feed_forward.w1 weight packed.
model.layers.23.feed_forward.w3 weight packed.
model.layers.23.feed_forward.w2 weight packed.
我们查看一下量化后的模型的详细信息:
对比量化前的模型:
模型小了不少。
下面使用Chat功能运行W4A16量化后的模型。
lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq
输出的内容与量化前的没有区别,不过这次一不小心试出了一个内部使用的token:
有关LMDeploy的lite功能的更多参数可通过lmdeploy lite -h
命令查看。
(lmdeploy) root@intern-studio-160311:~# lmdeploy lite -h
usage: lmdeploy lite [-h] {auto_awq,calibrate,kv_qparams,smooth_quant} ...
Compressing and accelerating LLMs with lmdeploy.lite module
options:
-h, --help show this help message and exit
Commands:
This group has the following commands:
{auto_awq,calibrate,kv_qparams,smooth_quant}
auto_awq Perform weight quantization using AWQ algorithm.
calibrate Perform calibration on a given dataset.
kv_qparams Export key and value stats.
smooth_quant Perform w8a8 quantization using SmoothQuant.
作业检查点:设置KV Cache最大占用比例为0.4,开启W4A16量化,以命令行方式与模型对话
上面,我们完成了模型的W4A16量化。所以我们这次运行的时候直接指定0.4的cache-max-entry-count
就好了。
lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.4
LMDeploy部署LLM
搭建API服务器:lmdeploy serve api_server
以API Server方式启动 lmdeploy:
lmdeploy serve api_server \
/root/internlm2-chat-1_8b-4bit \
--model-format awq \
--quant-policy 0 \
--server-name 0.0.0.0 \
--server-port 23333 \
--tp 1 \
--cache-max-entry-count 0.4
其中,model-format、quant-policy这些参数是与量化推理模型一致的;server-name和server-port表示API服务器的服务IP与服务端口;tp参数表示并行数量(GPU数量)。请注意这里作业要求是开启 W4A16量化,调整KV Cache的占用比例为0.4的模型,所以模型名称、模型文件名、模型格式和cache-max-entry-count
都要手动设定。这里的代码已经设置好了。
可以通过运行lmdeploy serve api_server -h
指令,查看更多参数及使用方法(这里好像和之前没啥区别。。。):
(lmdeploy) (base) root@intern-studio-160311:~# lmdeploy serve api_server -h
usage: lmdeploy serve api_server [-h] [--server-name SERVER_NAME] [--server-port SERVER_PORT]
[--allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...]] [--allow-credentials]
[--allow-methods ALLOW_METHODS [ALLOW_METHODS ...]]
[--allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...]] [--qos-config-path QOS_CONFIG_PATH]
[--backend {pytorch,turbomind}]
[--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}]
[--api-keys [API_KEYS ...]] [--ssl] [--meta-instruction META_INSTRUCTION]
[--chat-template CHAT_TEMPLATE] [--cap {completion,infilling,chat,python}]
[--adapters [ADAPTERS ...]] [--tp TP] [--model-name MODEL_NAME] [--session-len SESSION_LEN]
[--max-batch-size MAX_BATCH_SIZE] [--cache-max-entry-count CACHE_MAX_ENTRY_COUNT]
[--cache-block-seq-len CACHE_BLOCK_SEQ_LEN] [--model-format {hf,llama,awq}]
[--quant-policy QUANT_POLICY] [--rope-scaling-factor ROPE_SCALING_FACTOR]
model_path
Serve LLMs with restful api using fastapi.
positional arguments:
model_path The path of a model. it could be one of the following options: - i) a local directory path of a
turbomind model which is converted by `lmdeploy convert` command or download from ii) and iii). - ii)
the model_id of a lmdeploy-quantized model hosted inside a model repo on huggingface.co, such as
"internlm/internlm-chat-20b-4bit", "lmdeploy/llama2-chat-70b-4bit", etc. - iii) the model_id of a
model hosted inside a model repo on huggingface.co, such as "internlm/internlm-chat-7b",
"qwen/qwen-7b-chat ", "baichuan-inc/baichuan2-7b-chat" and so on. Type: str
options:
-h, --help show this help message and exit
--server-name SERVER_NAME
Host ip for serving. Default: 0.0.0.0. Type: str
--server-port SERVER_PORT
Server port. Default: 23333. Type: int
--allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...]
A list of allowed origins for cors. Default: ['*']. Type: str
--allow-credentials Whether to allow credentials for cors. Default: False
--allow-methods ALLOW_METHODS [ALLOW_METHODS ...]
A list of allowed http methods for cors. Default: ['*']. Type: str
--allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...]
A list of allowed http headers for cors. Default: ['*']. Type: str
--qos-config-path QOS_CONFIG_PATH
Qos policy config path. Default: . Type: str
--backend {pytorch,turbomind}
Set the inference backend. Default: turbomind. Type: str
--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}
Set the log level. Default: ERROR. Type: str
--api-keys [API_KEYS ...]
Optional list of space separated API keys. Default: None. Type: str
--ssl Enable SSL. Requires OS Environment variables 'SSL_KEYFILE' and 'SSL_CERTFILE'. Default: False
--meta-instruction META_INSTRUCTION
System prompt for ChatTemplateConfig. Deprecated. Please use --chat-template instead. Default: None.
Type: str
--chat-template CHAT_TEMPLATE
A JSON file or string that specifies the chat template configuration. Please refer to
https://lmdeploy.readthedocs.io/en/latest/advance/chat_template.html for the specification. Default:
None. Type: str
--cap {completion,infilling,chat,python}
The capability of a model. Deprecated. Please use --chat-template instead. Default: chat. Type: str
PyTorch engine arguments:
--adapters [ADAPTERS ...]
Used to set path(s) of lora adapter(s). One can input key-value pairs in xxx=yyy format for multiple
lora adapters. If only have one adapter, one can only input the path of the adapter.. Default: None.
Type: str
--tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
--model-name MODEL_NAME
The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc. You can run
`lmdeploy list` to get the supported model names. Default: None. Type: str
--session-len SESSION_LEN
The max session length of a sequence. Default: None. Type: int
--max-batch-size MAX_BATCH_SIZE
Maximum batch size. Default: 128. Type: int
--cache-max-entry-count CACHE_MAX_ENTRY_COUNT
The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type: float
--cache-block-seq-len CACHE_BLOCK_SEQ_LEN
The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability
is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch
Engine, if Lora Adapter is specified, this parameter will be ignored. Default: 64. Type: int
TurboMind engine arguments:
--tp TP GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
--model-name MODEL_NAME
The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc. You can run
`lmdeploy list` to get the supported model names. Default: None. Type: str
--session-len SESSION_LEN
The max session length of a sequence. Default: None. Type: int
--max-batch-size MAX_BATCH_SIZE
Maximum batch size. Default: 128. Type: int
--cache-max-entry-count CACHE_MAX_ENTRY_COUNT
The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type: float
--cache-block-seq-len CACHE_BLOCK_SEQ_LEN
The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability
is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch
Engine, if Lora Adapter is specified, this parameter will be ignored. Default: 64. Type: int
--model-format {hf,llama,awq}
The format of input model. `hf` meaning `hf_llama`, `llama` meaning `meta_llama`, `awq` meaning the
quantized model by awq. Default: None. Type: str
--quant-policy QUANT_POLICY
Whether to use kv int8. Default: 0. Type: int
--rope-scaling-factor ROPE_SCALING_FACTOR
Rope scaling factor. Default: 0.0. Type: float
你也可以直接打开http://localhost:23333
查看接口的swagger文档获取具体使用说明:
当然,这需要先使用SSH端口转发23333端口至本地:
ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p <你的ssh端口号>
通过命令行连接API服务器:lmdeploy serve api_client
运行命令行客户端:
lmdeploy serve api_client http://localhost:23333
运行后,可以通过命令行窗口直接与模型对话:
此时服务端输出为:
资源占用情况是:
通过Gradio连接服务器:lmdeploy serve gradio
使用Gradio作为前端,启动网页server。在远程开发机里新建一个终端,运行下面代码:
lmdeploy serve gradio http://localhost:23333 \
--server-name 0.0.0.0 \
--server-port 6006
然后在本地转发6006端口:
ssh -CNg -L 6006:127.0.0.1:6006 root@ssh.intern-ai.org.cn -p <你的ssh端口号>
打开浏览器,访问地址http://127.0.0.1:6006
就可以与模型进行对话了:
此时,服务端输出为:
Gradio端输出为:
资源占用为:
Python代码集成:lmdeploy.pipeline
我们创建代码/root/pipeline_kv.py
:
from lmdeploy import pipeline, TurbomindEngineConfig
# 因为作业要求是**开启 W4A16量化,调整KV Cache的占用比例为0.4**,所以这里和教程上略微有所不同
# 调整KV Cache的占用比例为0.4
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.4)
# 指定使用的模型为W4A16量化后的模型
model_path = '/root/internlm2-chat-1_8b-4bit'
pipe = pipeline(model_path, backend_config=backend_config)
# 运行pipeline。这里用一个列表包含几个输入,lmdeploy同时推理这几个输入产生多个输出结果
response = pipe(['Hi, pls intro yourself', '上海是', 'please provide three suggestions about time management'])
print(response)
运行并得到结果,很快啊:
这里的回答是和前面都是一样的,看不出太大差别。
上面的代码中,pipeline()
的backend_config
参数是可选的。
推理速度比较
在前面,我们已经测试了压缩前的速度,下面来测试一下LMDeploy的推理速度。新建python文件benchmark_lmdeploy.py
,填入以下内容:
import datetime
from lmdeploy import pipeline
pipe = pipeline('/root/internlm2-chat-1_8b')
# warmup
inp = "hello"
for i in range(5):
print("Warm up...[{}/5]".format(i+1))
response = pipe([inp])
# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
response = pipe([inp])
total_words += len(response[0].text)
end_time = datetime.datetime.now()
delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.4f} words/s".format(speed))
运行结果为:
相比于量化之前,这边测试出来快了7倍。
如果将pipeline('/root/internlm2-chat-1_8b')
换为pipeline('/root/internlm2-chat-1_8b-4bit')
,运行结果为:
相比于量化之前,这边测试出来快了17倍。
LMDeploy量化部署llava
搭建环境
在我们创建的lmdeploy
环境中安装llava依赖库:
pip install git+https://github.com/haotian-liu/LLaVA.git@4e2277a060da264c4f21b364c867cc622c945874
Python运行压缩的llava
我们新建代码/root/pipeline_llava.py
,内容为:
from lmdeploy import pipeline
from lmdeploy.vl import load_image
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b') 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b')
# 从github下载一张关于老虎的图片
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)
运行:
占用资源卡在了80%:
但是其实我换张图之后,它就会在运行时什么都不输出
Gradio运行压缩的llava
新建代码/root/pipeline_llava_gradio.py
,内容为:
import gradio as gr
from lmdeploy import pipeline
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b') 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b')
def model(image, text):
if image is None:
return [(text, "请上传一张图片。")]
else:
response = pipe((text, image)).text
return [(text, response)]
demo = gr.Interface(fn=model, inputs=[gr.Image(type="pil"), gr.Textbox()], outputs=gr.Chatbot())
demo.launch()
同样的,我们ssh转发7860端口。
ssh -CNg -L 7860:127.0.0.1:7860 root@ssh.intern-ai.org.cn -p <你的ssh端口>
之后通过浏览器访问http://127.0.0.1:7860
。
但是,模型并没有任何输出。
我们换回教程中的那张图试试:
然后输出又正常了。
那就这样吧。