使用xinference部署自定义embedding模型(docker)
使用xinference部署自定义embedding模型(docker)
说明:
- 首次发表日期:2024-08-27
- 官方文档: https://inference.readthedocs.io/zh-cn/latest/index.html
使用docker部署xinference
FROM nvcr.io/nvidia/pytorch:23.10-py3
# Keeps Python from generating .pyc files in the container
ENV PYTHONDONTWRITEBYTECODE=1
# Turns off buffering for easier container logging
ENV PYTHONUNBUFFERED=1
RUN python3 -m pip uninstall -y transformer-engine
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --no-cache-dir --index-url https://download.pytorch.org/whl/cu121
# If there are network issue, you can download torch whl file and use it
# ADD torch-2.3.0+cu121-cp310-cp310-linux_x86_64.whl /root/torch-2.3.0+cu121-cp310-cp310-linux_x86_64.whl
# RUN python3 -m pip install /root/torch-2.3.0+cu121-cp310-cp310-linux_x86_64.whl
RUN python3 -m pip install packaging setuptools==69.5.1 --no-cache-dir -i https://mirror.baidu.com/pypi/simple
RUN python3 -m pip install -U ninja --no-cache-dir -i https://mirror.baidu.com/pypi/simple
RUN python3 -m pip install flash-attn==2.5.8 --no-build-isolation --no-cache-dir
RUN python3 -m pip install "xinference[all]" --no-cache-dir -i https://repo.huaweicloud.com/repository/pypi/simple
EXPOSE 80
CMD ["sh", "-c", "tail -f /dev/null"]
构建镜像
docker build -t myxinference:latest .
另外,如果使用huggingface的话,建议使用 https://hf-mirror.com/ 镜像(记得docker部署时设置HF_ENDPOINT环境变量)。
以下假设部署后的服务地址为 http://localhost:9997
部署自定义 embedding 模型
准备embedding模型自定义JSON文件
创建文件夹custom_models/embedding
:
mkdir -p custom_models/embedding
然后创建以下模型自定义JSON文件:
360Zhinao-search.json
:
{
"model_name": "360Zhinao-search",
"dimensions": 1024,
"max_tokens": 512,
"language": ["en", "zh"],
"model_id": "qihoo360/360Zhinao-search",
"model_format": "pytorch"
}
gte-Qwen2-7B-instruct.json
:
{
"model_name": "gte-Qwen2-7B-instruct",
"dimensions": 4096,
"max_tokens": 32768,
"language": ["en", "zh"],
"model_id": "Alibaba-NLP/gte-Qwen2-7B-instruct",
"model_format": "pytorch"
}
zpoint_large_embedding_zh.json
:
{
"model_name": "zpoint_large_embedding_zh",
"dimensions": 1792,
"max_tokens": 512,
"language": ["zh"],
"model_id": "iampanda/zpoint_large_embedding_zh",
"model_format": "pytorch"
}
注意:对于下载到本地的模型可以设置 model_uri
参数,例如 file:///path/to/llama-2-7b
。
注册自定义 embedding 模型
xinference register --model-type embedding --file custom_models/embedding/360Zhinao-search.json --persist --endpoint http://localhost:9997
xinference register --model-type embedding --file custom_models/embedding/gte-Qwen2-7B-instruct.json --persist --endpoint http://localhost:9997
xinference register --model-type embedding --file custom_models/embedding/zpoint_large_embedding_zh.json --persist --endpoint http://localhost:9997
启动自定义 embedding 模型
xinference launch --model-type embedding --model-name gte-Qwen2-7B-instruct --model-engine transformers --model-format pytorch --endpoint http://localhost:9997
xinference launch --model-type embedding --model-name 360Zhinao-search --model-engine transformers --model-format pytorch --endpoint http://localhost:9997
xinference launch --model-type embedding --model-name zpoint_large_embedding_zh --model-engine transformers --model-format pytorch --endpoint http://localhost:9997
启动bge-m3和bge-reranker-base模型
bge-m3和bge-reranker-base是比较常用的embedding模型和reranking模型。
xinference launch --model-name bge-m3 --model-type embedding --endpoint http://localhost:9997
xinference launch --model-name bge-reranker-base --model-type rerank --endpoint http://localhost:9997
curl调用测试
embedding:
curl http://localhost:9997/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"input": "The food was delicious and the waiter...",
"model": "360Zhinao-search",
"encoding_format": "float"
}'
reranking:
curl http://localhost:9997/v1/rerank \
-H "Content-Type: application/json" \
-d '{
"model": "bge-reranker-base",
"query": "I love you",
"documents": [
"I hate you",
"I really like you",
"天空是什么颜色的",
"黑芝麻味饼干"
],
"top_n": 3
}'