基于vllm,探索产业级llm的部署

一、基本情况

vLLM 部署大模型 官方网址: https://vllm.ai github 地址:https://github.com/vllm-project/vllm

vLLM 是一个快速且易于使用的库,用于进行大型语言模型(LLM)的推理和服务。

它具有以下特点:

  • 速度快: 在每个请求需要 3 个并行输出完成时的服务吞吐量。vLLM 比 HuggingFace Transformers(HF)的吞吐量高出 8.5 倍-15 倍,比 HuggingFace 文本生成推理(TGI)的吞吐量高出 3.3 倍-3.5 倍
  • 优化的 CUDA 内核
  • 灵活且易于使用:
  • 与流行的 Hugging Face 模型无缝集成。
  • 高吞吐量服务,支持多种解码算法,包括并行抽样、束搜索等。
  • 支持张量并行处理,实现分布式推理。
  • 支持流式输出。
  • 兼容 OpenAI API 服务器。

支持的模型

vLLM 无缝支持多个 Hugging Face 模型,包括 Aquila、Baichuan、BLOOM、Falcon、GPT-2、GPT BigCode、GPT-J、GPT-NeoX、InternLM、LLaMA、Mistral、MPT、OPT、Qwen 等不同架构的模型。(https://vllm.readthedocs.io/en/latest/models/supported_models.html)

6517df08-70b3-4b19-9fb1-4940617c76ff

d1f64bfe-065a-4abc-bc22-e8d9af699aff

目前,glm3和llama3都分别自己提供了openai样式的服务,现在看一看vLLM有哪些不同?

二、初步实验

安装:

pip install vllm

下载:

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
import os
model_dir = snapshot_download('LLM-Research/Meta-Llama-3-8B-Instruct', cache_dir='/root/autodl-tmp', revision='master')

运行以上代码。

调用:

python -m vllm.entrypoints.openai.api_server --model  /root/autodl-tmp/LLM-Research/Meta-Llama-3-8B-Instruct  --trust-remote-code --port 6006

790da06e-1bcb-439e-b948-f3052a08a677

资源占用:

115be5d0-8f56-4e12-92b8-bda6fa3850e1

尝试通过postman进行调用:

curl http://localhost:6006/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
        "model": "/root/autodl-tmp/LLM-Research/Meta-Llama-3-8B-Instruct",
        "max_tokens":60,
        "messages": [
            {
                "role": "user",
                "content": "你是谁?"
            }
        ]
    }'

 

这种结果获得方式、以及速度都不是最好的:

ab852d1a-634c-470a-8e35-29fcf8d35288

于此相对,采用系统自带服务,显存占用更少。

f7583863-c739-4d97-add1-15d7d17154c4

单次测试代码可以直接运行,并且能够很好地和现有代码进行融合。

import requests
import json
def get_completion(prompt):
    headers = {'Content-Type': 'application/json'}
    data = {"prompt": prompt}
    response = requests.post(url='http://127.0.0.1:6006', headers=headers, data=json.dumps(data))
return response.json()['response']
if __name__ == '__main__':
print(get_completion('1+1=?'))

三、双卡实验(略写)

分布式推理

vLLM 支持分布式张量并行推理和服务,使用 Ray 管理分布式运行时,请使用以下命令安装 Ray:

pip install ray

分布式推理实验,要运行多 GPU 服务,请在启动服务器时传入 --tensor-parallel-size 参数。

例如,要在 2 个 GPU 上运行 API 服务器:

python -m vllm.entrypoints.openai.api_server --model /root/autodl-tmp/Yi-6B-Chat --dtype auto --api-key token-agiclass  --trust-remote-code --port 6006 --tensor-parallel-size 2

多卡调用一定是关键的能力,但是现在我还没有足够的动机来研究相关问题。

四、小结提炼

通过初步阅读理解相关代码,vLLM在openai的调用这块采用了类似的方法;但是可能是为了并行,导致它的体量比较大,并且出现了不兼容现象。

目前主要观点,仍然是基于现有的体系来进行应用编写。非常关键的一点是要懂原理,这样的话才能够应对各种情况。而对原理的探索能力一定是核心要素。

https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py

 

import asyncio
import importlib
import inspect
import os
from contextlib import asynccontextmanager
from http import HTTPStatus
import fastapi
import uvicorn
from fastapi import Request
from fastapi.exceptions import RequestValidationError
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse, Response, StreamingResponse
from prometheus_client import make_asgi_app
import vllm
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.entrypoints.openai.cli_args import make_arg_parser
from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
                                              ChatCompletionResponse,
                                              CompletionRequest, ErrorResponse)
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
from vllm.logger import init_logger
from vllm.usage.usage_lib import UsageContext
TIMEOUT_KEEP_ALIVE = 5 # seconds
openai_serving_chat: OpenAIServingChat
openai_serving_completion: OpenAIServingCompletion
logger = init_logger(__name__)
@asynccontextmanager
async def lifespan(app: fastapi.FastAPI):
    async def _force_log():
while True:
            await asyncio.sleep(10)
            await engine.do_log_stats()
if not engine_args.disable_log_stats:
        asyncio.create_task(_force_log())
    yield
app = fastapi.FastAPI(lifespan=lifespan)
def parse_args():
    parser = make_arg_parser()
return parser.parse_args()
# Add prometheus asgi middleware to route /metrics requests
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
@app.exception_handler(RequestValidationError)
async def validation_exception_handler(_, exc):
    err = openai_serving_chat.create_error_response(message=str(exc))
return JSONResponse(err.model_dump(), status_code=HTTPStatus.BAD_REQUEST)
@app.get("/health")
async def health() -> Response:
"""Health check."""
    await openai_serving_chat.engine.check_health()
return Response(status_code=200)
@app.get("/v1/models")
async def show_available_models():
    models = await openai_serving_chat.show_available_models()
return JSONResponse(content=models.model_dump())
@app.get("/version")
async def show_version():
    ver = {"version": vllm.__version__}
return JSONResponse(content=ver)
@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest,
                                 raw_request: Request):
    generator = await openai_serving_chat.create_chat_completion(
        request, raw_request)
if isinstance(generator, ErrorResponse):
return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)
if request.stream:
return StreamingResponse(content=generator,
                                 media_type="text/event-stream")
else:
assert isinstance(generator, ChatCompletionResponse)
return JSONResponse(content=generator.model_dump())
@app.post("/v1/completions")
async def create_completion(request: CompletionRequest, raw_request: Request):
    generator = await openai_serving_completion.create_completion(
        request, raw_request)
if isinstance(generator, ErrorResponse):
return JSONResponse(content=generator.model_dump(),
                            status_code=generator.code)
if request.stream:
return StreamingResponse(content=generator,
                                 media_type="text/event-stream")
else:
return JSONResponse(content=generator.model_dump())
if __name__ == "__main__":
    args = parse_args()
    app.add_middleware(
        CORSMiddleware,
        allow_origins=args.allowed_origins,
        allow_credentials=args.allow_credentials,
        allow_methods=args.allowed_methods,
        allow_headers=args.allowed_headers,
    )
if token := os.environ.get("VLLM_API_KEY") or args.api_key:
        @app.middleware("http")
        async def authentication(request: Request, call_next):
            root_path = "" if args.root_path is None else args.root_path
if not request.url.path.startswith(f"{root_path}/v1"):
return await call_next(request)
if request.headers.get("Authorization") != "Bearer " + token:
return JSONResponse(content={"error": "Unauthorized"},
                                    status_code=401)
return await call_next(request)
for middleware in args.middleware:
        module_path, object_name = middleware.rsplit(".", 1)
        imported = getattr(importlib.import_module(module_path), object_name)
if inspect.isclass(imported):
            app.add_middleware(imported)
elif inspect.iscoroutinefunction(imported):
            app.middleware("http")(imported)
else:
raise ValueError(f"Invalid middleware {middleware}. "
                             f"Must be a function or a class.")
    logger.info(f"vLLM API server version {vllm.__version__}")
    logger.info(f"args: {args}")
if args.served_model_name is not None:
        served_model_names = args.served_model_name
else:
        served_model_names = [args.model]
    engine_args = AsyncEngineArgs.from_cli_args(args)
    engine = AsyncLLMEngine.from_engine_args(
        engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
    openai_serving_chat = OpenAIServingChat(engine, served_model_names,
                                            args.response_role,
                                            args.lora_modules,
                                            args.chat_template)
    openai_serving_completion = OpenAIServingCompletion(
        engine, served_model_names, args.lora_modules)
    app.root_path = args.root_path
    uvicorn.run(app,
                host=args.host,
                port=args.port,
                log_level=args.uvicorn_log_level,
                timeout_keep_alive=TIMEOUT_KEEP_ALIVE,
                ssl_keyfile=args.ssl_keyfile,
                ssl_certfile=args.ssl_certfile,
                ssl_ca_certs=args.ssl_ca_certs,
                ssl_cert_reqs=args.ssl_cert_reqs)

 

posted on 2024-04-24 14:18  jsxyhelu  阅读(2690)  评论(0编辑  收藏  举报

导航