litellm fastapi sse api 集成简单说明

实际上就是基于litellm+ ollama 对于gemma2 包装一个stream rest api ,以下是一个简单说明

参考玩法

基于litellm 对于ollama 的模型实现proxy,同时增强安全控制能力,ollama 集成了qwen2 、gemma2 以及其他模型

litellm proxy 配置

litellm 支持静态配置以及动态db 模式,以下是一个简单配置

  • 配置
model_list:
  - model_name: qwen2
    litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
      model: ollama/qwen2:1.5b
      api_base: http://localhost:11434
      api_key: demo
      rpm: 60
  - model_name: qwen2
    litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
      model: ollama/qwen2:0.5b
      api_base: http://localhost:11434
      api_key: demo
      rpm: 60
router_settings:
  routing_strategy: usage-based-routing-v2 
# db  配置
general_settings: 
  master_key: sk-1234 
  store_model_in_db: true
  database_url: "postgresql://postgres:postgres@localhost:5432/postgres"
  • 启动
    注意启动需要开启export STORE_MODEL_IN_DB='True' (当然此参数也可以配置到配置文件中)如下是启动命令
litellm --config ./configv3.yaml
  • db 存储配置proxy 模型
    如下图(登陆账户就是上边配置的)

代码

代码基于了fastapi,对于stream 的玩法使用了sse_starlette.sse 扩展以及StreamingResponse 模式,对于openai api 的集成直接使用了标准openai

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import openai
from sse_starlette.sse import EventSourceResponse
from fastapi.middleware.cors import CORSMiddleware
app = FastAPI()
import asyncio
# Set your OpenAI API key
openai.api_key = "sk-fYaaDIMuxY17MOhuwAkcxA"
openai.base_url = "http://localhost:4000"
 
def sse_format(message: str):
    return f"data: {message}\n\n"
 
 
app.add_middleware(CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"])
 
 
@app.get("/stream")
async def stream_openai(prompt: str):
    async def generate():
        response = openai.chat.completions.create(
            model="gemma2",
            stream=True,
            messages = [
                {
                    "role": "user",
                    "content": prompt
                }
            ]
        )
        for chunk in response:
            choice = chunk.choices[0]
            yield (choice.model_dump_json())
            await asyncio.sleep(0.1)
    return EventSourceResponse(generate())
 
@app.get("/streamv2")  
async def openai_stream(prompt: str):  
    messages = [{"role": "user", "content": prompt}]  
 
    async def stream_response():  
        response = openai.chat.completions.create(  
            model="gemma2",  
            messages=messages,  
            stream=True  
        )
        for chunk in response:
            choice = chunk.choices[0]
            yield sse_format(choice.model_dump_json())
            # 模拟时延
            await asyncio.sleep(0.1)
 
    return StreamingResponse(stream_response(), media_type="text/event-stream")  
 
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

说明

以上是一个简单说明,核心还是litellm 的配置,其他都比较简单

参考资料

litellm/proxy/proxy_server.py
https://docs.litellm.ai/docs/proxy/deploy
https://docs.litellm.ai/docs/proxy/configs

posted on 2024-08-05 08:00  荣锋亮  阅读(55)  评论(0编辑  收藏  举报

导航