8卡3090GPU云服务器上采用VLLM部署中文llama2-70b模型及OpenAI格式接口

TigerBot-70b-4k-v4 推理部署

模型本地部署（基于HuggingFace)

根据实际测试，加载模型需要约129G显存，最低需要6张3090显卡（流水线并行）

如果使用vllm进行加速推理（张量并行），考虑8张3090显卡或者4张A100-40G（模型分割要求）

模型下载

截至目前，模型数据仅在huggingface上保存，在恒源云上的下载方式如下：

开启恒源云代理

 export https_proxy=http://turbo.gpushare.com:30000 http_proxy=http://turbo.gpushare.com:30000

访问模型下载地址

在这里建议使用wget下载模型文件，优点是能够断点续传，下方是wget示例

 wget https://huggingface.co/TigerResearch/tigerbot-70b-chat-v4-4k/resolve/main/pytorch_model-00001-of-00015.bin

关闭恒源云代理

 unset http_proxy && unset https_proxy

依赖安装

克隆官方github仓库

 git clone https://github.com/TigerResearch/TigerBot.git && cd Tigerbot

安装依赖库

 pip install -r requirements.txt

模型推理

对于普通的多卡推理，示例推理代码如下

 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python infer.py --model_path /path/to/your/model --max_input_length 1024 --max_generate_length 1024 --streaming True

vllm 加速推理

安装vllm

 pip install vllm

创建新的推理.py文件

 import torch
from vllm import LLM, SamplingParams
 
# Set the number of GPUs you want to use
num_gpus = 8  # Change this to the number of GPUs you have
 
# Define your prompts and sampling parameters
prompts = """
### Instruction:
第一次指令
 
### Instruction:
第二次指令
 
### Response:
"""
sampling_params = SamplingParams(temperature=1, top_p=0.9, top_k=50, max_tokens=512, stop="</s>")
 
# Initialize the VLLM model
llm = LLM(model="/hy-tmp/tigerbot-70b-chat-v4-4k", tensor_parallel_size=8, trust_remote_code=True)
 
# Move the model to GPUs
llm = torch.nn.DataParallel(llm, device_ids=list(range(num_gpus)))
 
# Generate outputs
outputs = llm.module.generate(prompts, sampling_params)
 
# Print the outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

需要注意的是这里的提示词格式与llama2不同，tigerbot的提示词遵循以下格式（注意最上面的两个空换行）

  
 
### Instruction:
第一次指令
 
### Response:

报错修复指引

安装过程中的报错大多是由于依赖库的版本问题，调整后可以解决。

flash-attn库安装报错

 /home/user/miniconda3/envs/textgen/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK3c106SymIntltEl

修复方法：重新构建 flash-attn库

 pip uninstall flash-attn
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn

OpenAI格式API部署

部署命令

还是在一台8卡的3090上，我们可以通过一行命令，部署TigerBot模型：

 python -m vllm.entrypoints.openai.api_server \
    --model="/hy-tmp/tigerbot-70b-chat-v4-4k" \
    --tensor-parallel-size 8 \
    --served-model-name "tigerbot" \
    --chat-template tiger_template.jinja \
    --host 0.0.0.0 \
    --port 8080

这里面的参数意思如下:

--model 模型参数的地址，可以是本地的也可以是云端的，本处为本地加载这个模型
tensor-parallel-size 张量并行的个数，本地有8卡，所以设置8 （注意这个数字必须能够整除head的个数）
served-model-name 这里是修改提供服务的模型的名称，默认情况下你的模型名字和model一样，你可以用这个进行修改（否则是一个很不美观的路径名，搞不好还要被攻击）
host port API暴露的本地IP和接口
--chat-template 这是为了将OpenAI的API中多轮对话的头，与TigerBot的多轮对话格式进行适配而使用的脚本，这里要用 $jinja$ 脚本，我撰写的jinja脚本如下:

 {{ "" }}
{% for message in messages %}
{% if message['role'] == 'user' %}
{{ "\n### Instruction:" }}
{% else %}
{{ "\n### Response:" }}
{% endif %}
{{ message['content'] }}
{% endfor %}
{{ "\n### Response:\n" }}

这里的chat_template其实就是huggingface中的chat_template格式。

注意，这个东西比较新，vllm 0.2.3开始才支持，如果你发现你报了下面这个错，请你马上升级。

 api_server.py: error: unrecognized arguments: --chat-templat

上面的jinja脚本，第一行也要保留（制造多一个\n），不要有缩进（有缩进会有额外的空格混进去）

启动成功测试

如果你看到下面的信息出来了，那么就代表你启动成功了

 INFO:     Started server process [49087]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

你可以用curl命令连接系统，看下有什么模型可用

 curl http://localhost:8080/v1/models

如果成功，你会看到下面这样的信息:

 {"object":"list","data":[{"id":"tigerbot","object":"model","created":1701951473,"owned_by":"vllm","root":"tigerbot","parent":null,"permission":[{"id":"modelperm-e084351f42514fd88aee16661312eaea","object":"model_permission","created":1701951473,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

API交互

我们可以curl，发送一些信息让模型处理

下面这个是参照OpenAI的completion撰写的，但是我套上了TigerBot的多轮对话

补全

 curl http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tigerbot",
        "prompt": "\n\n### Instruction:\n你是谁？\n\n### Response:\n",
        "max_tokens": 1024,
        "temperature": 1
    }'

一个标准的单轮对话

 curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tigerbot",
        "messages": [
            {"role": "user", "content": "3+5=?"}
        ]
    }'

返回的信息:

 {
  "id": "cmpl-002b8cd331814cb6b8dde2d70340a024",
  "object": "chat.completion",
  "created": 10628423,
  "model": "tigerbot",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": " 3+5=8"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "total_tokens": 16,
    "completion_tokens": 7
  }
}

下面这个是多轮对话的测试

 curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "tigerbot",
        "messages": [
            {"role": "user", "content": "3+5=?"},
            {"role": "assistant", "content": "3+5=8"},
            {"role": "user", "content": "再加上4"}
        ]
    }'

外网链接

我在恒源云上进行的测试部署

只要把端口部署在8080，然后开启恒源云的API自定义服务，就会给你一个链接，替换上去就可以了

我当时测试的时候是http://i-1.gpushare.com:30028/v1/chat/completions这个连接。

理论上，你还能用各种frp转发来实现

OpenAI的Python代码实现

和正常的代码一样，但需要修改API_base

注意api_key，默认是EMPTY

 from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
 
# 这里写内网IP和外网IP取决于你的连接环境
openai_api_base = "http://i-1.gpushare.com:30028/v1"
 
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
 
completion = client.chat.completions.create(
    model="tigerbot",
    messages=[
        {"role": "user", "content": "你是谁"},
    ]
)
print("Chat response:", completion.choices[0].message.content)

VLLM压测

单线程情况下的输出速度在23token每秒

多线程可以达到320token每秒

posted @ 2023-12-07 21:44 AlphaInf 阅读(8020) 评论(3) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· OpenBuddy-LLaMA2-70B的三种部署方式

· 基于vllm 0.3.0部署 llama2-70B模型

· 大模型推理指南：使用 vLLM 实现高效推理

· python系列&deep_study系列：大模型的N种高效部署方法：以LLama2为例

· vLLM CPU和GPU模式署和推理 Qwen2 等大语言模型详细教程

阅读排行：
· 为什么说在企业级应用开发中，后端往往是效率杀手？
· 本地部署DeepSeek后，没有好看的交互界面怎么行！
· 趁着过年的时候手搓了一个低代码框架
· 用 C# 插值字符串处理器写一个 sscanf
· 推荐一个DeepSeek 大模型的免费 API 项目！兼容OpenAI接口！

历史上的今天：
2017-12-07 【bzoj2961】共点圆 k-d树

公告

昵称： AlphaInf
园龄： 7年3个月
粉丝： 71
关注： 4

+加关注

2025年2月

日

一

二

三

四

五

六

	--model_path: 模型路径
	--model_type=chat: base/chat
	--max_input_length=1024: 最大输入长度
	--max_generate_length=1024: 最大输出长度
	--rope_scaling=None: 长度外推方法(dynamic/yarn supported now)
	--rope_factor=8.0: 外推参数

$\mathit{AlphaINF}$

跑得快，不一定赢；稳如老狗，才能长久。

8卡3090GPU云服务器上采用VLLM部署中文llama2-70b模型及OpenAI格式接口

TigerBot-70b-4k-v4 推理部署

模型本地部署（基于HuggingFace)

模型下载

依赖安装

模型推理

vllm 加速推理

报错修复指引

flash-attn库安装报错

OpenAI格式API部署

部署命令

启动成功测试

API交互

补全

一个标准的单轮对话

下面这个是多轮对话的测试

外网链接

OpenAI的Python代码实现

VLLM压测

公告

常用链接

最新随笔

积分与排名

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论

	import torch
	from vllm import LLM, SamplingParams

	# Set the number of GPUs you want to use
	num_gpus = 8 # Change this to the number of GPUs you have

	# Define your prompts and sampling parameters
	prompts = """
	### Instruction:
	第一次指令

	### Instruction:
	第二次指令

	### Response:
	"""
	sampling_params = SamplingParams(temperature=1, top_p=0.9, top_k=50, max_tokens=512, stop="</s>")

	# Initialize the VLLM model
	llm = LLM(model="/hy-tmp/tigerbot-70b-chat-v4-4k", tensor_parallel_size=8, trust_remote_code=True)

	# Move the model to GPUs
	llm = torch.nn.DataParallel(llm, device_ids=list(range(num_gpus)))

	# Generate outputs
	outputs = llm.module.generate(prompts, sampling_params)

	# Print the outputs
	for output in outputs:
	prompt = output.prompt
	generated_text = output.outputs[0].text
	print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

	pip uninstall flash-attn
	FLASH_ATTENTION_FORCE_BUILD=TRUE pip install flash-attn

	python -m vllm.entrypoints.openai.api_server \
	--model="/hy-tmp/tigerbot-70b-chat-v4-4k" \
	--tensor-parallel-size 8 \
	--served-model-name "tigerbot" \
	--chat-template tiger_template.jinja \
	--host 0.0.0.0 \
	--port 8080

	{{ "" }}
	{% for message in messages %}
	{% if message['role'] == 'user' %}
	{{ "\n### Instruction:" }}
	{% else %}
	{{ "\n### Response:" }}
	{% endif %}
	{{ message['content'] }}
	{% endfor %}
	{{ "\n### Response:\n" }}

	INFO: Started server process [49087]
	INFO: Waiting for application startup.
	INFO: Application startup complete.
	INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

	curl http://localhost:8080/v1/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "tigerbot",
	"prompt": "\n\n### Instruction:\n你是谁？\n\n### Response:\n",
	"max_tokens": 1024,
	"temperature": 1
	}'

	curl http://localhost:8080/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "tigerbot",
	"messages": [
	{"role": "user", "content": "3+5=?"}
	]
	}'

	{
	"id": "cmpl-002b8cd331814cb6b8dde2d70340a024",
	"object": "chat.completion",
	"created": 10628423,
	"model": "tigerbot",
	"choices": [
	{
	"index": 0,
	"message": {
	"role": "assistant",
	"content": " 3+5=8"
	},
	"finish_reason": "stop"
	}
	],
	"usage": {
	"prompt_tokens": 9,
	"total_tokens": 16,
	"completion_tokens": 7
	}
	}

	from openai import OpenAI
	# Set OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"

	# 这里写内网IP和外网IP取决于你的连接环境
	openai_api_base = "http://i-1.gpushare.com:30028/v1"

	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	completion = client.chat.completions.create(
	model="tigerbot",
	messages=[
	{"role": "user", "content": "你是谁"},
	]
	)
	print("Chat response:", completion.choices[0].message.content)

AlphaINF\mathit{AlphaINF}

跑得快，不一定赢；稳如老狗，才能长久。

8卡3090GPU云服务器上采用VLLM部署中文llama2-70b模型及OpenAI格式接口

TigerBot-70b-4k-v4 推理部署

模型本地部署（基于HuggingFace)

模型下载

依赖安装

模型推理

vllm 加速推理

报错修复指引

flash-attn库安装报错

OpenAI格式API部署

部署命令

启动成功测试

API交互

补全

一个标准的单轮对话

下面这个是多轮对话的测试

外网链接

OpenAI的Python代码实现

VLLM压测

公告

常用链接

最新随笔

积分与排名

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论

$\mathit{AlphaINF}$