Loading

LMDeploy量化部署LLM&LVM实操-书生浦语大模型实战营第二期第5节作业

书生浦语大模型实战营第二期第5节作业

本页面包括实战营第二期第五节作业的全部操作步骤。如果需要知道模型量化部署的相关知识请访问学习笔记

作业要求

基础作业

完成以下任务,并将实现过程记录截图:

  • 配置lmdeploy运行环境
  • 下载internlm-chat-1.8b模型
  • 以命令行方式与模型对话

进阶作业

完成以下任务,并将实现过程记录截图:

  • 设置KV Cache最大占用比例为0.4,开启W4A16量化,以命令行方式与模型对话。
  • 以API Server方式启动 lmdeploy,开启 W4A16量化,调整KV Cache的占用比例为0.4,分别使用命令行客户端与Gradio网页客户端与模型对话。
  • 使用W4A16量化,调整KV Cache的占用比例为0.4,使用Python代码集成的方式运行internlm2-chat-1.8b模型。
  • 使用 LMDeploy 运行视觉多模态大模型 llava gradio demo
  • 将 LMDeploy Web Demo 部署到 OpenXLab (OpenXLab cuda 12.2 的镜像还没有 ready,可先跳过,一周之后再来做)

LMDeploy量化LLM

新建环境

因为cuda11.7-conda的镜像与新版本的lmdeploy会出现兼容性问题。所以我们需要新建镜像为cuda12.2-conda的开发机,选择10% A100的GPU。

同时与之前的作业不同,这里使用studio-conda搭建的环境是基于“预制环境”pytorch-2.1.2的,而不是之前的internlm-base。这个环境是一个空环境,这意味着如果需要在本地使用直接创建一个python=3.10的空conda环境就ok。

studio-conda -t lmdeploy -o pytorch-2.1.2
点击查看完整的pytorch-2.1.2环境软件包列表
# packages in environment at /root/.conda/envs/lmdeploy:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main    defaults
_openmp_mutex             5.1                       1_gnu    defaults
asttokens                 2.4.1                    pypi_0    pypi
blas                      1.0                         mkl    defaults
brotli-python             1.0.9           py310h6a678d5_7    defaults
bzip2                     1.0.8                h5eee18b_5    defaults
ca-certificates           2024.3.11            h06a4308_0    defaults
certifi                   2024.2.2        py310h06a4308_0    defaults
charset-normalizer        2.0.4              pyhd3eb1b0_0    defaults
comm                      0.2.2                    pypi_0    pypi
cuda-cudart               12.1.105                      0    nvidia
cuda-cupti                12.1.105                      0    nvidia
cuda-libraries            12.1.0                        0    nvidia
cuda-nvrtc                12.1.105                      0    nvidia
cuda-nvtx                 12.1.105                      0    nvidia
cuda-opencl               12.4.127                      0    nvidia
cuda-runtime              12.1.0                        0    nvidia
debugpy                   1.8.1                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
einops                    0.7.0                    pypi_0    pypi
exceptiongroup            1.2.0                    pypi_0    pypi
executing                 2.0.1                    pypi_0    pypi
ffmpeg                    4.3                  hf484d3e_0    pytorch
filelock                  3.13.1          py310h06a4308_0    defaults
freetype                  2.12.1               h4a9f257_0    defaults
gmp                       6.2.1                h295c915_3    defaults
gmpy2                     2.1.2           py310heeb90bb_0    defaults
gnutls                    3.6.15               he1e5248_0    defaults
idna                      3.4             py310h06a4308_0    defaults
intel-openmp              2023.1.0         hdb19cb5_46306    defaults
ipykernel                 6.29.4                   pypi_0    pypi
ipython                   8.23.0                   pypi_0    pypi
jedi                      0.19.1                   pypi_0    pypi
jinja2                    3.1.3           py310h06a4308_0    defaults
jpeg                      9e                   h5eee18b_1    defaults
jupyter-client            8.6.1                    pypi_0    pypi
jupyter-core              5.7.2                    pypi_0    pypi
lame                      3.100                h7b6447c_0    defaults
lcms2                     2.12                 h3be6417_0    defaults
ld_impl_linux-64          2.38                 h1181459_1    defaults
lerc                      3.0                  h295c915_0    defaults
libcublas                 12.1.0.26                     0    nvidia
libcufft                  11.0.2.4                      0    nvidia
libcufile                 1.9.0.20                      0    nvidia
libcurand                 10.3.5.119                    0    nvidia
libcusolver               11.4.4.55                     0    nvidia
libcusparse               12.0.2.55                     0    nvidia
libdeflate                1.17                 h5eee18b_1    defaults
libffi                    3.4.4                h6a678d5_0    defaults
libgcc-ng                 11.2.0               h1234567_1    defaults
libgomp                   11.2.0               h1234567_1    defaults
libiconv                  1.16                 h7f8727e_2    defaults
libidn2                   2.3.4                h5eee18b_0    defaults
libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
libnpp                    12.0.2.50                     0    nvidia
libnvjitlink              12.1.105                      0    nvidia
libnvjpeg                 12.1.1.14                     0    nvidia
libpng                    1.6.39               h5eee18b_0    defaults
libstdcxx-ng              11.2.0               h1234567_1    defaults
libtasn1                  4.19.0               h5eee18b_0    defaults
libtiff                   4.5.1                h6a678d5_0    defaults
libunistring              0.9.10               h27cfd23_0    defaults
libuuid                   1.41.5               h5eee18b_0    defaults
libwebp-base              1.3.2                h5eee18b_0    defaults
llvm-openmp               14.0.6               h9e868ea_0    defaults
lz4-c                     1.9.4                h6a678d5_0    defaults
markupsafe                2.1.3           py310h5eee18b_0    defaults
matplotlib-inline         0.1.6                    pypi_0    pypi
mkl                       2023.1.0         h213fc3f_46344    defaults
mkl-service               2.4.0           py310h5eee18b_1    defaults
mkl_fft                   1.3.8           py310h5eee18b_0    defaults
mkl_random                1.2.4           py310hdb19cb5_0    defaults
mpc                       1.1.0                h10f8cd9_1    defaults
mpfr                      4.0.2                hb69a4c5_1    defaults
mpmath                    1.3.0           py310h06a4308_0    defaults
ncurses                   6.4                  h6a678d5_0    defaults
nest-asyncio              1.6.0                    pypi_0    pypi
nettle                    3.7.3                hbbd107a_1    defaults
networkx                  3.1             py310h06a4308_0    defaults
numpy                     1.26.4          py310h5f9d8c6_0    defaults
numpy-base                1.26.4          py310hb5e798b_0    defaults
openh264                  2.1.1                h4ff587b_0    defaults
openjpeg                  2.4.0                h3ad879b_0    defaults
openssl                   3.0.13               h7f8727e_0    defaults
packaging                 24.0                     pypi_0    pypi
parso                     0.8.4                    pypi_0    pypi
pexpect                   4.9.0                    pypi_0    pypi
pillow                    10.2.0          py310h5eee18b_0    defaults
pip                       23.3.1          py310h06a4308_0    defaults
platformdirs              4.2.0                    pypi_0    pypi
prompt-toolkit            3.0.43                   pypi_0    pypi
protobuf                  5.26.1                   pypi_0    pypi
psutil                    5.9.8                    pypi_0    pypi
ptyprocess                0.7.0                    pypi_0    pypi
pure-eval                 0.2.2                    pypi_0    pypi
pygments                  2.17.2                   pypi_0    pypi
pysocks                   1.7.1           py310h06a4308_0    defaults
python                    3.10.14              h955ad1f_0    defaults
python-dateutil           2.9.0.post0              pypi_0    pypi
pytorch                   2.1.2           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
pytorch-cuda              12.1                 ha16c6d3_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pyyaml                    6.0.1           py310h5eee18b_0    defaults
pyzmq                     25.1.2                   pypi_0    pypi
readline                  8.2                  h5eee18b_0    defaults
requests                  2.31.0          py310h06a4308_1    defaults
setuptools                68.2.2          py310h06a4308_0    defaults
six                       1.16.0                   pypi_0    pypi
sqlite                    3.41.2               h5eee18b_0    defaults
stack-data                0.6.3                    pypi_0    pypi
sympy                     1.12            py310h06a4308_0    defaults
tbb                       2021.8.0             hdb19cb5_0    defaults
tk                        8.6.12               h1ccaba5_0    defaults
torchaudio                2.1.2               py310_cu121    pytorch
torchtriton               2.1.0                     py310    pytorch
torchvision               0.16.2              py310_cu121    pytorch
tornado                   6.4                      pypi_0    pypi
traitlets                 5.14.2                   pypi_0    pypi
typing_extensions         4.9.0           py310h06a4308_1    defaults
tzdata                    2024a                h04d1e81_0    defaults
urllib3                   2.1.0           py310h06a4308_1    defaults
wcwidth                   0.2.13                   pypi_0    pypi
wheel                     0.41.2          py310h06a4308_0    defaults
xz                        5.4.6                h5eee18b_0    defaults
yaml                      0.2.5                h7b6447c_0    defaults
zlib                      1.2.13               h5eee18b_0    defaults
zstd                      1.5.5                hc292b87_0    defaults

之后激活刚刚创建的虚拟环境并安装0.3.0版本的lmdeploy,等待安装结束。

conda activate lmdeploy
pip install lmdeploy[all]==0.3.0
点击查看完整的lmdeploy环境软件包列表
# packages in environment at /root/.conda/envs/lmdeploy:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main    defaults
_openmp_mutex             5.1                       1_gnu    defaults
accelerate                0.29.1                   pypi_0    pypi
addict                    2.4.0                    pypi_0    pypi
aiofiles                  23.2.1                   pypi_0    pypi
aiohttp                   3.9.3                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
altair                    5.3.0                    pypi_0    pypi
annotated-types           0.6.0                    pypi_0    pypi
anyio                     4.3.0                    pypi_0    pypi
asttokens                 2.4.1                    pypi_0    pypi
async-timeout             4.0.3                    pypi_0    pypi
attrs                     23.2.0                   pypi_0    pypi
blas                      1.0                         mkl    defaults
brotli-python             1.0.9           py310h6a678d5_7    defaults
bzip2                     1.0.8                h5eee18b_5    defaults
ca-certificates           2024.3.11            h06a4308_0    defaults
certifi                   2024.2.2        py310h06a4308_0    defaults
charset-normalizer        2.0.4              pyhd3eb1b0_0    defaults
click                     8.1.7                    pypi_0    pypi
comm                      0.2.2                    pypi_0    pypi
contourpy                 1.2.1                    pypi_0    pypi
cuda-cudart               12.1.105                      0    nvidia
cuda-cupti                12.1.105                      0    nvidia
cuda-libraries            12.1.0                        0    nvidia
cuda-nvrtc                12.1.105                      0    nvidia
cuda-nvtx                 12.1.105                      0    nvidia
cuda-opencl               12.4.127                      0    nvidia
cuda-runtime              12.1.0                        0    nvidia
cycler                    0.12.1                   pypi_0    pypi
datasets                  2.18.0                   pypi_0    pypi
debugpy                   1.8.1                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
dill                      0.3.8                    pypi_0    pypi
einops                    0.7.0                    pypi_0    pypi
exceptiongroup            1.2.0                    pypi_0    pypi
executing                 2.0.1                    pypi_0    pypi
fastapi                   0.110.1                  pypi_0    pypi
ffmpeg                    4.3                  hf484d3e_0    pytorch
ffmpy                     0.3.2                    pypi_0    pypi
filelock                  3.13.1          py310h06a4308_0    defaults
fire                      0.6.0                    pypi_0    pypi
fonttools                 4.51.0                   pypi_0    pypi
freetype                  2.12.1               h4a9f257_0    defaults
frozenlist                1.4.1                    pypi_0    pypi
fsspec                    2024.2.0                 pypi_0    pypi
gmp                       6.2.1                h295c915_3    defaults
gmpy2                     2.1.2           py310heeb90bb_0    defaults
gnutls                    3.6.15               he1e5248_0    defaults
gradio                    3.50.2                   pypi_0    pypi
gradio-client             0.6.1                    pypi_0    pypi
grpcio                    1.62.1                   pypi_0    pypi
h11                       0.14.0                   pypi_0    pypi
httpcore                  1.0.5                    pypi_0    pypi
httpx                     0.27.0                   pypi_0    pypi
huggingface-hub           0.22.2                   pypi_0    pypi
idna                      3.4             py310h06a4308_0    defaults
importlib-metadata        7.1.0                    pypi_0    pypi
importlib-resources       6.4.0                    pypi_0    pypi
intel-openmp              2023.1.0         hdb19cb5_46306    defaults
ipykernel                 6.29.4                   pypi_0    pypi
ipython                   8.23.0                   pypi_0    pypi
jedi                      0.19.1                   pypi_0    pypi
jinja2                    3.1.3           py310h06a4308_0    defaults
jpeg                      9e                   h5eee18b_1    defaults
jsonschema                4.21.1                   pypi_0    pypi
jsonschema-specifications 2023.12.1                pypi_0    pypi
jupyter-client            8.6.1                    pypi_0    pypi
jupyter-core              5.7.2                    pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
lame                      3.100                h7b6447c_0    defaults
lcms2                     2.12                 h3be6417_0    defaults
ld_impl_linux-64          2.38                 h1181459_1    defaults
lerc                      3.0                  h295c915_0    defaults
libcublas                 12.1.0.26                     0    nvidia
libcufft                  11.0.2.4                      0    nvidia
libcufile                 1.9.0.20                      0    nvidia
libcurand                 10.3.5.119                    0    nvidia
libcusolver               11.4.4.55                     0    nvidia
libcusparse               12.0.2.55                     0    nvidia
libdeflate                1.17                 h5eee18b_1    defaults
libffi                    3.4.4                h6a678d5_0    defaults
libgcc-ng                 11.2.0               h1234567_1    defaults
libgomp                   11.2.0               h1234567_1    defaults
libiconv                  1.16                 h7f8727e_2    defaults
libidn2                   2.3.4                h5eee18b_0    defaults
libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
libnpp                    12.0.2.50                     0    nvidia
libnvjitlink              12.1.105                      0    nvidia
libnvjpeg                 12.1.1.14                     0    nvidia
libpng                    1.6.39               h5eee18b_0    defaults
libstdcxx-ng              11.2.0               h1234567_1    defaults
libtasn1                  4.19.0               h5eee18b_0    defaults
libtiff                   4.5.1                h6a678d5_0    defaults
libunistring              0.9.10               h27cfd23_0    defaults
libuuid                   1.41.5               h5eee18b_0    defaults
libwebp-base              1.3.2                h5eee18b_0    defaults
llvm-openmp               14.0.6               h9e868ea_0    defaults
lmdeploy                  0.3.0                    pypi_0    pypi
lz4-c                     1.9.4                h6a678d5_0    defaults
markdown-it-py            3.0.0                    pypi_0    pypi
markupsafe                2.1.3           py310h5eee18b_0    defaults
matplotlib                3.8.4                    pypi_0    pypi
matplotlib-inline         0.1.6                    pypi_0    pypi
mdurl                     0.1.2                    pypi_0    pypi
mkl                       2023.1.0         h213fc3f_46344    defaults
mkl-service               2.4.0           py310h5eee18b_1    defaults
mkl_fft                   1.3.8           py310h5eee18b_0    defaults
mkl_random                1.2.4           py310hdb19cb5_0    defaults
mmengine-lite             0.10.3                   pypi_0    pypi
mpc                       1.1.0                h10f8cd9_1    defaults
mpfr                      4.0.2                hb69a4c5_1    defaults
mpmath                    1.3.0           py310h06a4308_0    defaults
multidict                 6.0.5                    pypi_0    pypi
multiprocess              0.70.16                  pypi_0    pypi
ncurses                   6.4                  h6a678d5_0    defaults
nest-asyncio              1.6.0                    pypi_0    pypi
nettle                    3.7.3                hbbd107a_1    defaults
networkx                  3.1             py310h06a4308_0    defaults
numpy                     1.26.4          py310h5f9d8c6_0    defaults
numpy-base                1.26.4          py310hb5e798b_0    defaults
nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
openh264                  2.1.1                h4ff587b_0    defaults
openjpeg                  2.4.0                h3ad879b_0    defaults
openssl                   3.0.13               h7f8727e_0    defaults
orjson                    3.10.0                   pypi_0    pypi
packaging                 24.0                     pypi_0    pypi
pandas                    2.2.1                    pypi_0    pypi
parso                     0.8.4                    pypi_0    pypi
peft                      0.9.0                    pypi_0    pypi
pexpect                   4.9.0                    pypi_0    pypi
pillow                    10.2.0          py310h5eee18b_0    defaults
pip                       23.3.1          py310h06a4308_0    defaults
platformdirs              4.2.0                    pypi_0    pypi
prompt-toolkit            3.0.43                   pypi_0    pypi
protobuf                  4.25.3                   pypi_0    pypi
psutil                    5.9.8                    pypi_0    pypi
ptyprocess                0.7.0                    pypi_0    pypi
pure-eval                 0.2.2                    pypi_0    pypi
pyarrow                   15.0.2                   pypi_0    pypi
pyarrow-hotfix            0.6                      pypi_0    pypi
pybind11                  2.12.0                   pypi_0    pypi
pydantic                  2.6.4                    pypi_0    pypi
pydantic-core             2.16.3                   pypi_0    pypi
pydub                     0.25.1                   pypi_0    pypi
pygments                  2.17.2                   pypi_0    pypi
pynvml                    11.5.0                   pypi_0    pypi
pyparsing                 3.1.2                    pypi_0    pypi
pysocks                   1.7.1           py310h06a4308_0    defaults
python                    3.10.14              h955ad1f_0    defaults
python-dateutil           2.9.0.post0              pypi_0    pypi
python-multipart          0.0.9                    pypi_0    pypi
python-rapidjson          1.16                     pypi_0    pypi
pytorch                   2.1.2           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
pytorch-cuda              12.1                 ha16c6d3_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2024.1                   pypi_0    pypi
pyyaml                    6.0.1           py310h5eee18b_0    defaults
pyzmq                     25.1.2                   pypi_0    pypi
readline                  8.2                  h5eee18b_0    defaults
referencing               0.34.0                   pypi_0    pypi
regex                     2023.12.25               pypi_0    pypi
requests                  2.31.0          py310h06a4308_1    defaults
rich                      13.7.1                   pypi_0    pypi
rpds-py                   0.18.0                   pypi_0    pypi
safetensors               0.4.2                    pypi_0    pypi
semantic-version          2.10.0                   pypi_0    pypi
sentencepiece             0.2.0                    pypi_0    pypi
setuptools                68.2.2          py310h06a4308_0    defaults
shortuuid                 1.0.13                   pypi_0    pypi
six                       1.16.0                   pypi_0    pypi
sniffio                   1.3.1                    pypi_0    pypi
sqlite                    3.41.2               h5eee18b_0    defaults
stack-data                0.6.3                    pypi_0    pypi
starlette                 0.37.2                   pypi_0    pypi
sympy                     1.12            py310h06a4308_0    defaults
tbb                       2021.8.0             hdb19cb5_0    defaults
termcolor                 2.4.0                    pypi_0    pypi
tiktoken                  0.6.0                    pypi_0    pypi
tk                        8.6.12               h1ccaba5_0    defaults
tokenizers                0.15.2                   pypi_0    pypi
tomli                     2.0.1                    pypi_0    pypi
toolz                     0.12.1                   pypi_0    pypi
torchaudio                2.1.2               py310_cu121    pytorch
torchtriton               2.1.0                     py310    pytorch
torchvision               0.16.2              py310_cu121    pytorch
tornado                   6.4                      pypi_0    pypi
tqdm                      4.66.2                   pypi_0    pypi
traitlets                 5.14.2                   pypi_0    pypi
transformers              4.38.2                   pypi_0    pypi
transformers-stream-generator 0.0.5                    pypi_0    pypi
tritonclient              2.44.0                   pypi_0    pypi
typing_extensions         4.9.0           py310h06a4308_1    defaults
tzdata                    2024.1                   pypi_0    pypi
urllib3                   2.1.0           py310h06a4308_1    defaults
uvicorn                   0.29.0                   pypi_0    pypi
wcwidth                   0.2.13                   pypi_0    pypi
websockets                11.0.3                   pypi_0    pypi
wheel                     0.41.2          py310h06a4308_0    defaults
xxhash                    3.4.1                    pypi_0    pypi
xz                        5.4.6                h5eee18b_0    defaults
yaml                      0.2.5                h7b6447c_0    defaults
yapf                      0.40.2                   pypi_0    pypi
yarl                      1.9.4                    pypi_0    pypi
zipp                      3.18.1                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_0    defaults
zstd                      1.5.5                hc292b87_0    defaults

下载模型

和之前一样为internlm2-chat-1_8b模型创建软链接。链接后的路径与前几个实战内容略有不同。

cd ~
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/

量化前模型推理测试

这里的代码和第二个作业里的第一个demo没什么区别,都是加载模型以后调用model.chat()获取模型输出。这一步的主要目的是测试模型输出是否正常以及体验模型推理速度。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

inp = "hello"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=[])
print("[OUTPUT]", response)

inp = "please provide three suggestions about time management"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=history)
print("[OUTPUT]", response)

运行结果如下:

运行结果

GPU占用如下:

GPU占用

对上面的代码稍加改造,测试一下模型压缩前的运行速度:

# python benchmark_transformer.py
import torch
import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response, history = model.chat(tokenizer, inp, history=[])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response, history = model.chat(tokenizer, inp, history=history)
    total_words += len(response)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.4f} words/s".format(speed))

运行速度

记住这个速度16.4092 words/s,后面会做对比。

模型对话:lmdeploy chat

使用 lmdeploy chat 命令就能在命令行里直接与大模型对话了,并且推理速度要快不少:

lmdeploy chat /root/internlm2-chat-1_8b

模型运行结果

GPU使用情况如下:

GPU使用情况

在lmdeploy中,如果要输入内容给模型,需要使用两次回车键;输入“exit”并按两下回车,可以退出对话。
这个命令有许多参数,可以通过lmdeploy chat -h查看帮助文档,改命令输出为:

usage: lmdeploy chat [-h] [--backend {pytorch,turbomind}] [--trust-remote-code]
                     [--meta-instruction META_INSTRUCTION] [--cap {completion,infilling,chat,python}]
                     [--adapters [ADAPTERS ...]] [--tp TP] [--model-name MODEL_NAME]
                     [--session-len SESSION_LEN] [--max-batch-size MAX_BATCH_SIZE]
                     [--cache-max-entry-count CACHE_MAX_ENTRY_COUNT] [--model-format {hf,llama,awq}]
                     [--quant-policy QUANT_POLICY] [--rope-scaling-factor ROPE_SCALING_FACTOR]
                     model_path

Chat with pytorch or turbomind engine.

positional arguments:
  model_path            The path of a model. it could be one of the following options: - i) a local
                        directory path of a turbomind model which is converted by `lmdeploy convert`
                        command or download from ii) and iii). - ii) the model_id of a lmdeploy-
                        quantized model hosted inside a model repo on huggingface.co, such as
                        "internlm/internlm-chat-20b-4bit", "lmdeploy/llama2-chat-70b-4bit", etc. -
                        iii) the model_id of a model hosted inside a model repo on huggingface.co,
                        such as "internlm/internlm-chat-7b", "qwen/qwen-7b-chat ", "baichuan-
                        inc/baichuan2-7b-chat" and so on. Type: str

options:
  -h, --help            show this help message and exit
  --backend {pytorch,turbomind}
                        Set the inference backend. Default: turbomind. Type: str
  --trust-remote-code   Trust remote code for loading hf models. Default: True
  --meta-instruction META_INSTRUCTION
                        System prompt for ChatTemplateConfig. Deprecated. Please use --chat-template
                        instead. Default: None. Type: str
  --cap {completion,infilling,chat,python}
                        The capability of a model. Deprecated. Please use --chat-template instead.
                        Default: chat. Type: str

PyTorch engine arguments:
  --adapters [ADAPTERS ...]
                        Used to set path(s) of lora adapter(s). One can input key-value pairs in
                        xxx=yyy format for multiple lora adapters. If only have one adapter, one can
                        only input the path of the adapter.. Default: None. Type: str
  --tp TP               GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
  --model-name MODEL_NAME
                        The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b
                        and etc. You can run `lmdeploy list` to get the supported model names.
                        Default: None. Type: str
  --session-len SESSION_LEN
                        The max session length of a sequence. Default: None. Type: int
  --max-batch-size MAX_BATCH_SIZE
                        Maximum batch size. Default: 128. Type: int
  --cache-max-entry-count CACHE_MAX_ENTRY_COUNT
                        The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type:
                        float

TurboMind engine arguments:
  --tp TP               GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
  --model-name MODEL_NAME
                        The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b
                        and etc. You can run `lmdeploy list` to get the supported model names.
                        Default: None. Type: str
  --session-len SESSION_LEN
                        The max session length of a sequence. Default: None. Type: int
  --max-batch-size MAX_BATCH_SIZE
                        Maximum batch size. Default: 128. Type: int
  --cache-max-entry-count CACHE_MAX_ENTRY_COUNT
                        The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type:
                        float
  --model-format {hf,llama,awq}
                        The format of input model. `hf` meaning `hf_llama`, `llama` meaning
                        `meta_llama`, `awq` meaning the quantized model by awq. Default: None. Type:
                        str
  --quant-policy QUANT_POLICY
                        Whether to use kv int8. Default: 0. Type: int
  --rope-scaling-factor ROPE_SCALING_FACTOR
                        Rope scaling factor. Default: 0.0. Type: float

我们注意到,参数--cache-max-entry-count的用途是控制KV缓存占用剩余显存的最大比例,默认的比例为0.8。这意味着后续作业只要更改这个参数就ok

模型量化与校准:lmdeploy lite

量化前需要安装einops库:

pip install einops==0.7.0

之后执行

lmdeploy lite auto_awq \
   /root/internlm2-chat-1_8b \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/internlm2-chat-1_8b-4bit

完成模型量化。该命令使用AWQ算法,实现模型4bit权重量化。推理引擎TurboMind提供了高效的4bit推理cuda kernel,性能是FP16的2.4倍以上。这一步耗时会非常非常的长。量化工作结束后,新的HF模型被保存到/root/internlm2-chat-1_8b-4bit目录。

点击查看这段代码运行的输出内容
(lmdeploy) root@intern-studio-160311:~# lmdeploy lite auto_awq \
> /root/internlm2-chat-1_8b \
> --calib-dataset 'ptb' \
> --calib-samples 128 \
> --calib-seqlen 1024 \
> --w-bits 4 \
> --w-group-size 128 \
> --work-dir /root/internlm2-chat-1_8b-4bit

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:35<00:00, 17.60s/it]
Move model.tok_embeddings to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.norm to GPU.
Move output to CPU.
Loading calibrate dataset ...
/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at <https://hf.co/datasets/ptb_text_only>
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Downloading builder script: 6.50kB [00:00, 24.9MB/s]
Downloading readme: 4.21kB [00:00, 19.9MB/s]
Downloading data: 5.10MB [01:05, 78.1kB/s]
Downloading data: 400kB [00:00, 402kB/s]
Downloading data: 450kB [00:09, 48.3kB/s]
Generating train split: 100%|██████████████████████████████████████████████████| 42068/42068 [00:00<00:00, 88086.81 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████| 3761/3761 [00:00<00:00, 100075.98 examples/s]
Generating validation split: 100%|██████████████████████████████████████████████| 3370/3370 [00:00<00:00, 100399.93 examples/s]
model.layers.0, samples: 128, max gpu memory: 2.25 GB
model.layers.1, samples: 128, max gpu memory: 2.75 GB
model.layers.2, samples: 128, max gpu memory: 2.75 GB
model.layers.3, samples: 128, max gpu memory: 2.75 GB
model.layers.4, samples: 128, max gpu memory: 2.75 GB
model.layers.5, samples: 128, max gpu memory: 2.75 GB
model.layers.6, samples: 128, max gpu memory: 2.75 GB
model.layers.7, samples: 128, max gpu memory: 2.75 GB
model.layers.8, samples: 128, max gpu memory: 2.75 GB
model.layers.9, samples: 128, max gpu memory: 2.75 GB
model.layers.10, samples: 128, max gpu memory: 2.75 GB
model.layers.11, samples: 128, max gpu memory: 2.75 GB
model.layers.12, samples: 128, max gpu memory: 2.75 GB
model.layers.13, samples: 128, max gpu memory: 2.75 GB
model.layers.14, samples: 128, max gpu memory: 2.75 GB
model.layers.15, samples: 128, max gpu memory: 2.75 GB
model.layers.16, samples: 128, max gpu memory: 2.75 GB
model.layers.17, samples: 128, max gpu memory: 2.75 GB
model.layers.18, samples: 128, max gpu memory: 2.75 GB
model.layers.19, samples: 128, max gpu memory: 2.75 GB
model.layers.20, samples: 128, max gpu memory: 2.75 GB
model.layers.21, samples: 128, max gpu memory: 2.75 GB
model.layers.22, samples: 128, max gpu memory: 2.75 GB
model.layers.23, samples: 128, max gpu memory: 2.75 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
model.layers.23 smooth weight done.
model.layers.0.attention.wqkv weight packed.
model.layers.0.attention.wo weight packed.
model.layers.0.feed_forward.w1 weight packed.
model.layers.0.feed_forward.w3 weight packed.
model.layers.0.feed_forward.w2 weight packed.
model.layers.1.attention.wqkv weight packed.
model.layers.1.attention.wo weight packed.
model.layers.1.feed_forward.w1 weight packed.
model.layers.1.feed_forward.w3 weight packed.
model.layers.1.feed_forward.w2 weight packed.
model.layers.2.attention.wqkv weight packed.
model.layers.2.attention.wo weight packed.
model.layers.2.feed_forward.w1 weight packed.
model.layers.2.feed_forward.w3 weight packed.
model.layers.2.feed_forward.w2 weight packed.
model.layers.3.attention.wqkv weight packed.
model.layers.3.attention.wo weight packed.
model.layers.3.feed_forward.w1 weight packed.
model.layers.3.feed_forward.w3 weight packed.
model.layers.3.feed_forward.w2 weight packed.
model.layers.4.attention.wqkv weight packed.
model.layers.4.attention.wo weight packed.
model.layers.4.feed_forward.w1 weight packed.
model.layers.4.feed_forward.w3 weight packed.
model.layers.4.feed_forward.w2 weight packed.
model.layers.5.attention.wqkv weight packed.
model.layers.5.attention.wo weight packed.
model.layers.5.feed_forward.w1 weight packed.
model.layers.5.feed_forward.w3 weight packed.
model.layers.5.feed_forward.w2 weight packed.
model.layers.6.attention.wqkv weight packed.
model.layers.6.attention.wo weight packed.
model.layers.6.feed_forward.w1 weight packed.
model.layers.6.feed_forward.w3 weight packed.
model.layers.6.feed_forward.w2 weight packed.
model.layers.7.attention.wqkv weight packed.
model.layers.7.attention.wo weight packed.
model.layers.7.feed_forward.w1 weight packed.
model.layers.7.feed_forward.w3 weight packed.
model.layers.7.feed_forward.w2 weight packed.
model.layers.8.attention.wqkv weight packed.
model.layers.8.attention.wo weight packed.
model.layers.8.feed_forward.w1 weight packed.
model.layers.8.feed_forward.w3 weight packed.
model.layers.8.feed_forward.w2 weight packed.
model.layers.9.attention.wqkv weight packed.
model.layers.9.attention.wo weight packed.
model.layers.9.feed_forward.w1 weight packed.
model.layers.9.feed_forward.w3 weight packed.
model.layers.9.feed_forward.w2 weight packed.
model.layers.10.attention.wqkv weight packed.
model.layers.10.attention.wo weight packed.
model.layers.10.feed_forward.w1 weight packed.
model.layers.10.feed_forward.w3 weight packed.
model.layers.10.feed_forward.w2 weight packed.
model.layers.11.attention.wqkv weight packed.
model.layers.11.attention.wo weight packed.
model.layers.11.feed_forward.w1 weight packed.
model.layers.11.feed_forward.w3 weight packed.
model.layers.11.feed_forward.w2 weight packed.
model.layers.12.attention.wqkv weight packed.
model.layers.12.attention.wo weight packed.
model.layers.12.feed_forward.w1 weight packed.
model.layers.12.feed_forward.w3 weight packed.
model.layers.12.feed_forward.w2 weight packed.
model.layers.13.attention.wqkv weight packed.
model.layers.13.attention.wo weight packed.
model.layers.13.feed_forward.w1 weight packed.
model.layers.13.feed_forward.w3 weight packed.
model.layers.13.feed_forward.w2 weight packed.
model.layers.14.attention.wqkv weight packed.
model.layers.14.attention.wo weight packed.
model.layers.14.feed_forward.w1 weight packed.
model.layers.14.feed_forward.w3 weight packed.
model.layers.14.feed_forward.w2 weight packed.
model.layers.15.attention.wqkv weight packed.
model.layers.15.attention.wo weight packed.
model.layers.15.feed_forward.w1 weight packed.
model.layers.15.feed_forward.w3 weight packed.
model.layers.15.feed_forward.w2 weight packed.
model.layers.16.attention.wqkv weight packed.
model.layers.16.attention.wo weight packed.
model.layers.16.feed_forward.w1 weight packed.
model.layers.16.feed_forward.w3 weight packed.
model.layers.16.feed_forward.w2 weight packed.
model.layers.17.attention.wqkv weight packed.
model.layers.17.attention.wo weight packed.
model.layers.17.feed_forward.w1 weight packed.
model.layers.17.feed_forward.w3 weight packed.
model.layers.17.feed_forward.w2 weight packed.
model.layers.18.attention.wqkv weight packed.
model.layers.18.attention.wo weight packed.
model.layers.18.feed_forward.w1 weight packed.
model.layers.18.feed_forward.w3 weight packed.
model.layers.18.feed_forward.w2 weight packed.
model.layers.19.attention.wqkv weight packed.
model.layers.19.attention.wo weight packed.
model.layers.19.feed_forward.w1 weight packed.
model.layers.19.feed_forward.w3 weight packed.
model.layers.19.feed_forward.w2 weight packed.
model.layers.20.attention.wqkv weight packed.
model.layers.20.attention.wo weight packed.
model.layers.20.feed_forward.w1 weight packed.
model.layers.20.feed_forward.w3 weight packed.
model.layers.20.feed_forward.w2 weight packed.
model.layers.21.attention.wqkv weight packed.
model.layers.21.attention.wo weight packed.
model.layers.21.feed_forward.w1 weight packed.
model.layers.21.feed_forward.w3 weight packed.
model.layers.21.feed_forward.w2 weight packed.
model.layers.22.attention.wqkv weight packed.
model.layers.22.attention.wo weight packed.
model.layers.22.feed_forward.w1 weight packed.
model.layers.22.feed_forward.w3 weight packed.
model.layers.22.feed_forward.w2 weight packed.
model.layers.23.attention.wqkv weight packed.
model.layers.23.attention.wo weight packed.
model.layers.23.feed_forward.w1 weight packed.
model.layers.23.feed_forward.w3 weight packed.
model.layers.23.feed_forward.w2 weight packed.

我们查看一下量化后的模型的详细信息:

对比量化前的模型:

量化前的模型

模型小了不少。
下面使用Chat功能运行W4A16量化后的模型。

lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq

输出的内容与量化前的没有区别,不过这次一不小心试出了一个内部使用的token:

输出内容

有关LMDeploy的lite功能的更多参数可通过lmdeploy lite -h命令查看。

(lmdeploy) root@intern-studio-160311:~# lmdeploy lite -h
usage: lmdeploy lite [-h] {auto_awq,calibrate,kv_qparams,smooth_quant} ...

Compressing and accelerating LLMs with lmdeploy.lite module

options:
  -h, --help            show this help message and exit

Commands:
  This group has the following commands:

  {auto_awq,calibrate,kv_qparams,smooth_quant}
    auto_awq            Perform weight quantization using AWQ algorithm.
    calibrate           Perform calibration on a given dataset.
    kv_qparams          Export key and value stats.
    smooth_quant        Perform w8a8 quantization using SmoothQuant.

作业检查点:设置KV Cache最大占用比例为0.4,开启W4A16量化,以命令行方式与模型对话

上面,我们完成了模型的W4A16量化。所以我们这次运行的时候直接指定0.4的cache-max-entry-count就好了。

lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.4

模型运行结果

显卡占用大小

LMDeploy部署LLM

搭建API服务器:lmdeploy serve api_server

以API Server方式启动 lmdeploy:

lmdeploy serve api_server \
    /root/internlm2-chat-1_8b-4bit \
    --model-format awq \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1 \
    --cache-max-entry-count 0.4

其中,model-format、quant-policy这些参数是与量化推理模型一致的;server-name和server-port表示API服务器的服务IP与服务端口;tp参数表示并行数量(GPU数量)。请注意这里作业要求是开启 W4A16量化,调整KV Cache的占用比例为0.4的模型,所以模型名称、模型文件名、模型格式和cache-max-entry-count都要手动设定。这里的代码已经设置好了。

可以通过运行lmdeploy serve api_server -h指令,查看更多参数及使用方法(这里好像和之前没啥区别。。。):

(lmdeploy) (base) root@intern-studio-160311:~# lmdeploy serve api_server -h
usage: lmdeploy serve api_server [-h] [--server-name SERVER_NAME] [--server-port SERVER_PORT]
                                 [--allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...]] [--allow-credentials]
                                 [--allow-methods ALLOW_METHODS [ALLOW_METHODS ...]]
                                 [--allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...]] [--qos-config-path QOS_CONFIG_PATH]
                                 [--backend {pytorch,turbomind}]
                                 [--log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}]
                                 [--api-keys [API_KEYS ...]] [--ssl] [--meta-instruction META_INSTRUCTION]
                                 [--chat-template CHAT_TEMPLATE] [--cap {completion,infilling,chat,python}]
                                 [--adapters [ADAPTERS ...]] [--tp TP] [--model-name MODEL_NAME] [--session-len SESSION_LEN]
                                 [--max-batch-size MAX_BATCH_SIZE] [--cache-max-entry-count CACHE_MAX_ENTRY_COUNT]
                                 [--cache-block-seq-len CACHE_BLOCK_SEQ_LEN] [--model-format {hf,llama,awq}]
                                 [--quant-policy QUANT_POLICY] [--rope-scaling-factor ROPE_SCALING_FACTOR]
                                 model_path

Serve LLMs with restful api using fastapi.

positional arguments:
  model_path            The path of a model. it could be one of the following options: - i) a local directory path of a
                        turbomind model which is converted by `lmdeploy convert` command or download from ii) and iii). - ii)
                        the model_id of a lmdeploy-quantized model hosted inside a model repo on huggingface.co, such as
                        "internlm/internlm-chat-20b-4bit", "lmdeploy/llama2-chat-70b-4bit", etc. - iii) the model_id of a
                        model hosted inside a model repo on huggingface.co, such as "internlm/internlm-chat-7b",
                        "qwen/qwen-7b-chat ", "baichuan-inc/baichuan2-7b-chat" and so on. Type: str

options:
  -h, --help            show this help message and exit
  --server-name SERVER_NAME
                        Host ip for serving. Default: 0.0.0.0. Type: str
  --server-port SERVER_PORT
                        Server port. Default: 23333. Type: int
  --allow-origins ALLOW_ORIGINS [ALLOW_ORIGINS ...]
                        A list of allowed origins for cors. Default: ['*']. Type: str
  --allow-credentials   Whether to allow credentials for cors. Default: False
  --allow-methods ALLOW_METHODS [ALLOW_METHODS ...]
                        A list of allowed http methods for cors. Default: ['*']. Type: str
  --allow-headers ALLOW_HEADERS [ALLOW_HEADERS ...]
                        A list of allowed http headers for cors. Default: ['*']. Type: str
  --qos-config-path QOS_CONFIG_PATH
                        Qos policy config path. Default: . Type: str
  --backend {pytorch,turbomind}
                        Set the inference backend. Default: turbomind. Type: str
  --log-level {CRITICAL,FATAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET}
                        Set the log level. Default: ERROR. Type: str
  --api-keys [API_KEYS ...]
                        Optional list of space separated API keys. Default: None. Type: str
  --ssl                 Enable SSL. Requires OS Environment variables 'SSL_KEYFILE' and 'SSL_CERTFILE'. Default: False
  --meta-instruction META_INSTRUCTION
                        System prompt for ChatTemplateConfig. Deprecated. Please use --chat-template instead. Default: None.
                        Type: str
  --chat-template CHAT_TEMPLATE
                        A JSON file or string that specifies the chat template configuration. Please refer to
                        https://lmdeploy.readthedocs.io/en/latest/advance/chat_template.html for the specification. Default:
                        None. Type: str
  --cap {completion,infilling,chat,python}
                        The capability of a model. Deprecated. Please use --chat-template instead. Default: chat. Type: str

PyTorch engine arguments:
  --adapters [ADAPTERS ...]
                        Used to set path(s) of lora adapter(s). One can input key-value pairs in xxx=yyy format for multiple
                        lora adapters. If only have one adapter, one can only input the path of the adapter.. Default: None.
                        Type: str
  --tp TP               GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
  --model-name MODEL_NAME
                        The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc. You can run
                        `lmdeploy list` to get the supported model names. Default: None. Type: str
  --session-len SESSION_LEN
                        The max session length of a sequence. Default: None. Type: int
  --max-batch-size MAX_BATCH_SIZE
                        Maximum batch size. Default: 128. Type: int
  --cache-max-entry-count CACHE_MAX_ENTRY_COUNT
                        The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type: float
  --cache-block-seq-len CACHE_BLOCK_SEQ_LEN
                        The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability
                        is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch
                        Engine, if Lora Adapter is specified, this parameter will be ignored. Default: 64. Type: int

TurboMind engine arguments:
  --tp TP               GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
  --model-name MODEL_NAME
                        The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc. You can run
                        `lmdeploy list` to get the supported model names. Default: None. Type: str
  --session-len SESSION_LEN
                        The max session length of a sequence. Default: None. Type: int
  --max-batch-size MAX_BATCH_SIZE
                        Maximum batch size. Default: 128. Type: int
  --cache-max-entry-count CACHE_MAX_ENTRY_COUNT
                        The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type: float
  --cache-block-seq-len CACHE_BLOCK_SEQ_LEN
                        The length of the token sequence in a k/v block. For Turbomind Engine, if the GPU compute capability
                        is >= 8.0, it should be a multiple of 32, otherwise it should be a multiple of 64. For Pytorch
                        Engine, if Lora Adapter is specified, this parameter will be ignored. Default: 64. Type: int
  --model-format {hf,llama,awq}
                        The format of input model. `hf` meaning `hf_llama`, `llama` meaning `meta_llama`, `awq` meaning the
                        quantized model by awq. Default: None. Type: str
  --quant-policy QUANT_POLICY
                        Whether to use kv int8. Default: 0. Type: int
  --rope-scaling-factor ROPE_SCALING_FACTOR
                        Rope scaling factor. Default: 0.0. Type: float

你也可以直接打开http://localhost:23333查看接口的swagger文档获取具体使用说明:

当然,这需要先使用SSH端口转发23333端口至本地:

ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p <你的ssh端口号>

通过命令行连接API服务器:lmdeploy serve api_client

运行命令行客户端:

lmdeploy serve api_client http://localhost:23333

运行后,可以通过命令行窗口直接与模型对话:

命令行

此时服务端输出为:

服务端输出

资源占用情况是:

资源占用

通过Gradio连接服务器:lmdeploy serve gradio

使用Gradio作为前端,启动网页server。在远程开发机里新建一个终端,运行下面代码:

lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 6006

然后在本地转发6006端口:

ssh -CNg -L 6006:127.0.0.1:6006 root@ssh.intern-ai.org.cn -p <你的ssh端口号>

打开浏览器,访问地址http://127.0.0.1:6006就可以与模型进行对话了:


此时,服务端输出为:

Gradio端输出为:

资源占用为:

Python代码集成:lmdeploy.pipeline

我们创建代码/root/pipeline_kv.py

from lmdeploy import pipeline, TurbomindEngineConfig

# 因为作业要求是**开启 W4A16量化,调整KV Cache的占用比例为0.4**,所以这里和教程上略微有所不同
# 调整KV Cache的占用比例为0.4
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.4)
# 指定使用的模型为W4A16量化后的模型
model_path = '/root/internlm2-chat-1_8b-4bit'

pipe = pipeline(model_path, backend_config=backend_config)
# 运行pipeline。这里用一个列表包含几个输入,lmdeploy同时推理这几个输入产生多个输出结果
response = pipe(['Hi, pls intro yourself', '上海是', 'please provide three suggestions about time management'])
print(response)

运行并得到结果,很快啊:

这里的回答是和前面都是一样的,看不出太大差别。

上面的代码中,pipeline()backend_config参数是可选的。

推理速度比较

在前面,我们已经测试了压缩前的速度,下面来测试一下LMDeploy的推理速度。新建python文件benchmark_lmdeploy.py,填入以下内容:

import datetime
from lmdeploy import pipeline

pipe = pipeline('/root/internlm2-chat-1_8b')

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = pipe([inp])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = pipe([inp])
    total_words += len(response[0].text)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.4f} words/s".format(speed))

运行结果为:

相比于量化之前,这边测试出来快了7倍。

如果将pipeline('/root/internlm2-chat-1_8b')换为pipeline('/root/internlm2-chat-1_8b-4bit'),运行结果为:

相比于量化之前,这边测试出来快了17倍。

LMDeploy量化部署llava

搭建环境

在我们创建的lmdeploy环境中安装llava依赖库:

pip install git+https://github.com/haotian-liu/LLaVA.git@4e2277a060da264c4f21b364c867cc622c945874

Python运行压缩的llava

我们新建代码/root/pipeline_llava.py,内容为:

from lmdeploy import pipeline
from lmdeploy.vl import load_image

# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b') 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b')
# 从github下载一张关于老虎的图片
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
response = pipe(('describe this image', image))
print(response)

运行:

占用资源卡在了80%:

但是其实我换张图之后,它就会在运行时什么都不输出

Gradio运行压缩的llava

新建代码/root/pipeline_llava_gradio.py,内容为:

import gradio as gr
from lmdeploy import pipeline

# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b') 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b')

def model(image, text):
    if image is None:
        return [(text, "请上传一张图片。")]
    else:
        response = pipe((text, image)).text
        return [(text, response)]

demo = gr.Interface(fn=model, inputs=[gr.Image(type="pil"), gr.Textbox()], outputs=gr.Chatbot())
demo.launch()   

同样的,我们ssh转发7860端口。

ssh -CNg -L 7860:127.0.0.1:7860 root@ssh.intern-ai.org.cn -p <你的ssh端口>

之后通过浏览器访问http://127.0.0.1:7860

但是,模型并没有任何输出。

我们换回教程中的那张图试试:

然后输出又正常了。

那就这样吧。

posted @ 2024-04-09 18:20  vanilla阿草  阅读(841)  评论(0编辑  收藏  举报