书生开源大模型训练营-第5讲-笔记
1、大模型部署的背景
a、大模型部署的挑战
- 对设备的要求:存储大
- 推理:生成token的速度、动态shape、内存管理
- 服务:高并发下的吞吐量、单用户的响应时间
b、技术方案:
- 模型并行、量化
- Transfomer优化、推理优化
2、LMDeploy介绍
a、是在N卡上的全流程方案,包括轻量化、推理和服务,当前还没有cover移动端。核心是推理引擎turbomind
b、为什么要量化?
- 降低显存要求:权重、KV Cache
- 提升推理速度
- 推理阶段,Decoder Only模型的瓶颈是内存访问,A100上实测batch_size=128时瓶颈由内存访问过渡到计算
c、TurboMind推理引擎
- Continous Batch Inferrence
- 有状态的推理
- Blocked K/V cache
- 高性能cuda kernel优化
d、推理服务api server
3、动手
3.1、环境准备
查看显卡资源使用情况:
实时查看GPU的使用情况:
创建虚拟环境:
/root/share/install_conda_env_internlm_base.sh lmdeploy
查看虚拟环境:
激活虚拟环境lmdeploy
Python检查一下 PyTorch:
安装lmdeploy最新稳定版:
验证lmdeploy已安装好
3.2、服务部署
a、模型转化为 TurboMind 的格式:
在线转换:
离线转换:
(lmdeploy) root@intern-studio-069640:~/ch5# lmdeploy convert internlm-chat-7b /root/share/model_repos/internlm-chat-7b create workspace in directory ./workspace copy triton model templates from "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/serve/turbomind/triton_models" to "./workspace/triton_models" copy service_docker_up.sh from "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/serve/turbomind/service_docker_up.sh" to "./workspace" model_name internlm-chat-7b model_format None inferred_model_format hf model_path /root/share/model_repos/internlm-chat-7b tokenizer_path /root/share/model_repos/internlm-chat-7b/tokenizer.model output_format fp16 WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer. *** splitting layers.0.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.0.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.0.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.0.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.0.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.0.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.0.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.1.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.1.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.1.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.1.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.1.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.1.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.1.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.2.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.2.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.2.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.2.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.2.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.2.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.2.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.3.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.3.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.3.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.3.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.3.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.3.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.3.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.4.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.4.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.4.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.4.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.4.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.4.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.4.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.5.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.5.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.5.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.5.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.5.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.5.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.5.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.6.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.6.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.6.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.6.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.6.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.6.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.6.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.7.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.7.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.7.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.7.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.7.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.7.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.7.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.8.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.8.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.8.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.8.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.8.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.8.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.8.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.9.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.9.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.9.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.9.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.9.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.9.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.9.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.10.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.10.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.10.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.10.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.10.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.10.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.10.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.11.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.11.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.11.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.11.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.11.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.11.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.11.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.12.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.12.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.12.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.12.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.12.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.12.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.12.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.13.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.13.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.13.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.13.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.13.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.13.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.13.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.14.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.14.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.14.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.14.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.14.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.14.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.14.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.15.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.15.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.15.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.15.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.15.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.15.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.15.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.16.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.16.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.16.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.16.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.16.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.16.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.16.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.17.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.17.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.17.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.17.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.17.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.17.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.17.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.18.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.18.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.18.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.18.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.18.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.18.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.18.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.19.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.19.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.19.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.19.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.19.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.19.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.19.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.20.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.20.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.20.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.20.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.20.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.20.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.20.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.21.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.21.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.21.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.21.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.21.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.21.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.21.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.22.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.22.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.22.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.22.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.22.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.22.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.22.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.23.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.23.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.23.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.23.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.23.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.23.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.23.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.24.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.24.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.24.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.24.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.24.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.24.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.24.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.25.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.25.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.25.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.25.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.25.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.25.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.25.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.26.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.26.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.26.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.26.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.26.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.26.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.26.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.27.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.27.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.27.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.27.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.27.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.27.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.27.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.28.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.28.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.28.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.28.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.28.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.28.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.28.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.29.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.29.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.29.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.29.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.29.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.29.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.29.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.30.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.30.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.30.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.30.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.30.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.30.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.30.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 *** splitting layers.31.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1 *** splitting layers.31.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1 *** splitting layers.31.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1 ### copying layers.31.attention.wo.bias, shape=torch.Size([4096]) *** splitting layers.31.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.31.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1 *** splitting layers.31.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1 Convert to turbomind format: 100%|███████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:40<00:00, 1.28s/it] (lmdeploy) root@intern-studio-069640:~/ch5#
本地命令行推理:
先尝试本地对话:
运用 lmdepoy 进行服务化:
做一个本地端口映射,即可在本地Macbook上查看API的信息:
网页演示:
做好本地端口映射:
即可试用!
3.3、模型量化
posted on 2024-02-18 20:50 littlesuccess 阅读(14) 评论(0) 编辑 收藏 举报
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 周边上新:园子的第一款马克杯温暖上架
· Open-Sora 2.0 重磅开源!
· 分享 3 个 .NET 开源的文件压缩处理库,助力快速实现文件压缩解压功能!
· Ollama——大语言模型本地部署的极速利器
· DeepSeek如何颠覆传统软件测试?测试工程师会被淘汰吗?