litellm fastapi sse api 集成简单说明
实际上就是基于litellm+ ollama 对于gemma2 包装一个stream rest api ,以下是一个简单说明
参考玩法
基于litellm 对于ollama 的模型实现proxy,同时增强安全控制能力,ollama 集成了qwen2 、gemma2 以及其他模型
litellm proxy 配置
litellm 支持静态配置以及动态db 模式,以下是一个简单配置
- 配置
model_list:
- model_name: qwen2
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: ollama/qwen2:1.5b
api_base: http://localhost:11434
api_key: demo
rpm: 60
- model_name: qwen2
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: ollama/qwen2:0.5b
api_base: http://localhost:11434
api_key: demo
rpm: 60
router_settings:
routing_strategy: usage-based-routing-v2
# db 配置
general_settings:
master_key: sk-1234
store_model_in_db: true
database_url: "postgresql://postgres:postgres@localhost:5432/postgres"
- 启动
注意启动需要开启export STORE_MODEL_IN_DB='True' (当然此参数也可以配置到配置文件中)如下是启动命令
litellm --config ./configv3.yaml
- db 存储配置proxy 模型
如下图(登陆账户就是上边配置的)
代码
代码基于了fastapi,对于stream 的玩法使用了sse_starlette.sse 扩展以及StreamingResponse 模式,对于openai api 的集成直接使用了标准openai
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import openai
from sse_starlette.sse import EventSourceResponse
from fastapi.middleware.cors import CORSMiddleware
app = FastAPI()
import asyncio
# Set your OpenAI API key
openai.api_key = "sk-fYaaDIMuxY17MOhuwAkcxA"
openai.base_url = "http://localhost:4000"
def sse_format(message: str):
return f"data: {message}\n\n"
app.add_middleware(CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"])
@app.get("/stream")
async def stream_openai(prompt: str):
async def generate():
response = openai.chat.completions.create(
model="gemma2",
stream=True,
messages = [
{
"role": "user",
"content": prompt
}
]
)
for chunk in response:
choice = chunk.choices[0]
yield (choice.model_dump_json())
await asyncio.sleep(0.1)
return EventSourceResponse(generate())
@app.get("/streamv2")
async def openai_stream(prompt: str):
messages = [{"role": "user", "content": prompt}]
async def stream_response():
response = openai.chat.completions.create(
model="gemma2",
messages=messages,
stream=True
)
for chunk in response:
choice = chunk.choices[0]
yield sse_format(choice.model_dump_json())
# 模拟时延
await asyncio.sleep(0.1)
return StreamingResponse(stream_response(), media_type="text/event-stream")
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
说明
以上是一个简单说明,核心还是litellm 的配置,其他都比较简单
参考资料
litellm/proxy/proxy_server.py
https://docs.litellm.ai/docs/proxy/deploy
https://docs.litellm.ai/docs/proxy/configs