milvus安装使用教程

python使用Milvus

版本:(Milvus == 0.10.2,pymilvus == 0.2.14)

拉取 Milvus 镜像

(Milvus 需要在docker上安装,虚拟机最好是ubuntu18.04,docker安装请自行查看菜鸟教程,以下默认已安装docker)

Milvus官网教程

拉取 CPU 版本的 Milvus 镜像:

$ sudo docker pull milvusdb/milvus:0.10.2-cpu-d081520-8a2393
  • 如果你的主机由于网络限制无法在线获得 Docker 镜像和配置文件,请从其他主机在线获取镜像,保存为 TAR 文件传输回本地,传输完成后重新加载为 Docker 镜像:点击查看离线传输相关代码示例。
  • 如果拉取镜像的速度过慢或一直失败,请参考 部署运维问题 中提供的解决办法。

下载配置文件

$ mkdir -p /home/$USER/milvus/conf
$ cd /home/$USER/milvus/conf
$ wget https://raw.githubusercontent.com/milvus-io/milvus/0.10.2/core/conf/demo/server_config.yaml

如果无法通过 wget 命令下载配置文件,你也可以在 /home/$USER/milvus/conf 目录下创建 server_config.yaml 文件,然后将 server config 文件 的内容复制到你创建的配置文件中。

启动 Milvus Docker 容器

启动 Docker 容器,将本地的文件路径映射到容器中:

$ sudo docker run -d --name milvus_cpu_0.10.2 \
-p 19530:19530 \
-p 19121:19121 \
-v /home/$USER/milvus/db:/var/lib/milvus/db \
-v /home/$USER/milvus/conf:/var/lib/milvus/conf \
-v /home/$USER/milvus/logs:/var/lib/milvus/logs \
-v /home/$USER/milvus/wal:/var/lib/milvus/wal \
milvusdb/milvus:0.10.2-cpu-d081520-8a2393

上述命令中用到的参数定义如下:

  • -d: 在后台运行容器。
  • --name: 为容器指定一个名字。
  • -p: 指定端口映射。
  • -v: 将宿主机路径挂载至容器。

确认 Milvus 运行状态:

$ sudo docker ps 
没有则说明启动失败
$ sudo docker ps -a查看所有容器,调整完后sudo docker restart 容器名

如果 Milvus 服务没有正常启动,执行以下命令查询错误日志:

$ sudo docker logs milvus_cpu_0.10.2

win10下链接虚拟机中的docker

虚拟机网络设置为桥接网卡(若需要联网使用则切换为网络地址转换NAT)

查看虚拟机ip

ifconfig

若失败则需安装网络工具包

sodu apt install net-tools

若此时docker已正常启动milvus容器,此时应有4个设备号(其中看起来像乱码的是milvus)

复制enp0s3的ip

// 重启网络
sudo /etc/init.d/networking restart
// 关闭防火墙
ufw disable

验证连接

安装pymilvus==0.2.14

pip install pymilvus==0.2.14

测试连接代码,按查找到的ip与启动容器时填写的port进行连接

简单测试

from milvus import Milvus, DataType
client = Milvus(host='localhost', port='19530')

官方测试代码

// example.py

import random
from pprint import pprint

from milvus import Milvus, DataType

# ------
# Setup:
#    First of all, you need a running Milvus(0.11.x). By default, Milvus runs on localhost in port 19530.
#    Then, you can use pymilvus(0.3.x) to connect to the server, You can change the _HOST and _PORT accordingly.
# ------
_HOST = '127.0.0.1'
_PORT = '19530'
client = Milvus(_HOST, _PORT)

# ------
# Basic create collection:
#     You already have a Milvus instance running, and pymilvus connecting to Milvus.
#     The first thing we will do is to create a collection `demo_films`. In case we've already had a collection
#     named `demo_films`, we drop it before we create.
# ------
collection_name = 'demo_films'
if collection_name in client.list_collections():
    client.drop_collection(collection_name)

# ------
# Basic create collection:
#     For a specific field, you can provide extra infos by a dictionary with `key = "params"`. If the field
#     has a type of `FLOAT_VECTOR` and `BINARY_VECTOR`, "dim" must be provided in extra infos. Otherwise
#     you can provide customized infos like `{"unit": "minutes"}` for you own need.
#
#     In our case, the extra infos in "duration" field means the unit of "duration" field is "minutes".
#     And `auto_id` in the parameter is set to `False` so that we can provide our own unique ids.
#     For more information you can refer to the pymilvus
#     documentation (https://pymilvus.readthedocs.io/en/latest/).
# ------
collection_param = {
    "fields": [
        #  Milvus doesn't support string type now, but we are considering supporting it soon.
        #  {"name": "title", "type": DataType.STRING},
        {"name": "duration", "type": DataType.INT32, "params": {"unit": "minute"}},
        {"name": "release_year", "type": DataType.INT32},
        {"name": "embedding", "type": DataType.FLOAT_VECTOR, "params": {"dim": 8}},
    ],
    "segment_row_limit": 4096,
    "auto_id": False
}

# ------
# Basic create collection:
#     After create collection `demo_films`, we create a partition tagged "American", it means the films we
#     will be inserted are from American.
# ------
client.create_collection(collection_name, collection_param)
client.create_partition(collection_name, "American")

# ------
# Basic create collection:
#     You can check the collection info and partitions we've created by `get_collection_info` and
#     `list_partitions`
# ------
print("--------get collection info--------")
collection = client.get_collection_info(collection_name)
pprint(collection)
partitions = client.list_partitions(collection_name)
print("\n----------list partitions----------")
pprint(partitions)

# ------
# Basic insert entities:
#     We have three films of The_Lord_of_the_Rings series here with their id, duration release_year
#     and fake embeddings to be inserted. They are listed below to give you a overview of the structure.
# ------
The_Lord_of_the_Rings = [
    {
        "title": "The_Fellowship_of_the_Ring",
        "id": 1,
        "duration": 208,
        "release_year": 2001,
        "embedding": [random.random() for _ in range(8)]
    },
    {
        "title": "The_Two_Towers",
        "id": 2,
        "duration": 226,
        "release_year": 2002,
        "embedding": [random.random() for _ in range(8)]
    },
    {
        "title": "The_Return_of_the_King",
        "id": 3,
        "duration": 252,
        "release_year": 2003,
        "embedding": [random.random() for _ in range(8)]
    }
]

# ------
# Basic insert entities:
#     To insert these films into Milvus, we have to group values from the same field together like below.
#     Then these grouped data are used to create `hybrid_entities`.
# ------
ids = [k.get("id") for k in The_Lord_of_the_Rings]
durations = [k.get("duration") for k in The_Lord_of_the_Rings]
release_years = [k.get("release_year") for k in The_Lord_of_the_Rings]
embeddings = [k.get("embedding") for k in The_Lord_of_the_Rings]

hybrid_entities = [
    # Milvus doesn't support string type yet, so we cannot insert "title".
    {"name": "duration", "values": durations, "type": DataType.INT32},
    {"name": "release_year", "values": release_years, "type": DataType.INT32},
    {"name": "embedding", "values": embeddings, "type": DataType.FLOAT_VECTOR},
]

# ------
# Basic insert entities:
#     We insert the `hybrid_entities` into our collection, into partition `American`, with ids we provide.
#     If succeed, ids we provide will be returned.
# ------
ids = client.insert(collection_name, hybrid_entities, ids, partition_tag="American")
print("\n----------insert----------")
print("Films are inserted and the ids are: {}".format(ids))


# ------
# Basic insert entities:
#     After insert entities into collection, we need to flush collection to make sure its on disk,
#     so that we are able to retrieve it.
# ------
before_flush_counts = client.count_entities(collection_name)
client.flush([collection_name])
after_flush_counts = client.count_entities(collection_name)
print("\n----------flush----------")
print("There are {} films in collection `{}` before flush".format(before_flush_counts, collection_name))
print("There are {} films in collection `{}` after flush".format(after_flush_counts, collection_name))

# ------
# Basic insert entities:
#     We can get the detail of collection statistics info by `get_collection_stats`
# ------
info = client.get_collection_stats(collection_name)
print("\n----------get collection stats----------")
pprint(info)

# ------
# Basic search entities:
#     Now that we have 3 films inserted into our collection, it's time to obtain them.
#     We can get films by ids, if milvus can't find entity for a given id, `None` will be returned.
#     In the case we provide below, we will only get 1 film with id=1 and the other is `None`
# ------
films = client.get_entity_by_id(collection_name, ids=[1, 200])
print("\n----------get entity by id = 1, id = 200----------")
for film in films:
    if film is not None:
        print(" > id: {},\n > duration: {}m,\n > release_years: {},\n > embedding: {}"
              .format(film.id, film.duration, film.release_year, film.embedding))

# ------
# Basic hybrid search entities:
#      Getting films by id is not enough, we are going to get films based on vector similarities.
#      Let's say we have a film with its `embedding` and we want to find `top3` films that are most similar
#      with it by L2 distance.
#      Other than vector similarities, we also want to obtain films that:
#        `released year` term in 2002 or 2003,
#        `duration` larger than 250 minutes.
#
#      Milvus provides Query DSL(Domain Specific Language) to support structured data filtering in queries.
#      For now milvus supports TermQuery and RangeQuery, they are structured as below.
#      For more information about the meaning and other options about "must" and "bool",
#      please refer to DSL chapter of our pymilvus documentation
#      (https://pymilvus.readthedocs.io/en/latest/).
# ------
query_embedding = [random.random() for _ in range(8)]
query_hybrid = {
    "bool": {
        "must": [
            {
                "term": {"release_year": [2002, 2003]}
            },
            {
                # "GT" for greater than
                "range": {"duration": {"GT": 250}}
            },
            {
                "vector": {
                    "embedding": {"topk": 3, "query": [query_embedding], "metric_type": "L2"}
                }
            }
        ]
    }
}

# ------
# Basic hybrid search entities:
#     And we want to get all the fields back in results, so fields = ["duration", "release_year", "embedding"].
#     If searching successfully, results will be returned.
#     `results` have `nq`(number of queries) separate results, since we only query for 1 film, The length of
#     `results` is 1.
#     We ask for top 3 in-return, but our condition is too strict while the database is too small, so we can
#     only get 1 film, which means length of `entities` in below is also 1.
#
#     Now we've gotten the results, and known it's a 1 x 1 structure, how can we get ids, distances and fields?
#     It's very simple, for every `topk_film`, it has three properties: `id, distance and entity`.
#     All fields are stored in `entity`, so you can finally obtain these data as below:
#     And the result should be film with id = 3.
# ------
results = client.search(collection_name, query_hybrid, fields=["duration", "release_year", "embedding"])
print("\n----------search----------")
for entities in results:
    for topk_film in entities:
        current_entity = topk_film.entity
        print("- id: {}".format(topk_film.id))
        print("- distance: {}".format(topk_film.distance))

        print("- release_year: {}".format(current_entity.release_year))
        print("- duration: {}".format(current_entity.duration))
        print("- embedding: {}".format(current_entity.embedding))

# ------
# Basic delete:
#     Now let's see how to delete things in Milvus.
#     You can simply delete entities by their ids.
# ------
client.delete_entity_by_id(collection_name, ids=[1, 2])
client.flush()  # flush is important
result = client.get_entity_by_id(collection_name, ids=[1, 2])

counts_delete = sum([1 for entity in result if entity is not None])
counts_in_collection = client.count_entities(collection_name)
print("\n----------delete id = 1, id = 2----------")
print("Get {} entities by id 1, 2".format(counts_delete))
print("There are {} entities after delete films with 1, 2".format(counts_in_collection))

# ------
# Basic delete:
#     You can drop partitions we create, and drop the collection we create.
# ------
client.drop_partition(collection_name, partition_tag='American')
if collection_name in client.list_collections():
    client.drop_collection(collection_name)

# ------
# Summary:
#     Now we've went through all basic communications pymilvus can do with Milvus server, hope it's helpful!
# ------

参考链接

https://www.runoob.com/docker/ubuntu-docker-install.html

https://milvus.io/cn/docs/v0.10.2/overview.md

https://blog.csdn.net/qq632683582/article/details/107446738

https://blog.csdn.net/weixin_40816738/article/details/90605327

Milvus简单使用教程

milvus admin

安装

docker pull milvusdb/milvus-em:latest

docker run -d -p 3000:80 milvusdb/milvus-em:latest

运行

打开浏览器,输入URL: http://localhost:3000/

pymilvus

参数:

topk 表示与目标向量最相似的 k 条向量,在搜索时定义。top_k 的取值范围是 (0, 2048]

nprobe:查询时所涉及的向量类的个数。nprobe 影响查询精度。数值越大,精度越高,速度越慢。

metric_type向量相似度度量标准, MetricType.IP是向量内积; MetricType.L2是欧式距离

网上的参考代码

# -*- coding: utf-8 -*-
 
#导入相应的包
import numpy as np
from milvus import Milvus, IndexType, MetricType
 
# 初始化一个Milvus类,以后所有的操作都是通过milvus来的
milvus = Milvus()
 
# 连接到服务器,注意端口映射,要和启动docker时设置的端口一致
milvus.connect(host='localhost', port='19530')
 
# 向量个数
num_vec = 5000
# 向量维度
vec_dim = 768
 
# 创建表
# 参数含义
# table_name: 表名
# dimension: 向量维度
# metric_type: 向量相似度度量标准, MetricType.IP是向量内积; MetricType.L2是欧式距离
table_param = {'table_name': 'mytable', 'dimension':vec_dim, 'index_file_size':1024, 'metric_type':MetricType.IP}
milvus.create_table(table_param)
 
# 随机生成一批向量数据
vectors_array = np.random.rand(num_vec,vec_dim)
vectors_list = vectors_array.tolist()
 
# 官方建议在插入向量之前,建议先使用 milvus.create_index 以便系统自动增量创建索引
# 索引类型有:FLAT / IVFLAT / IVF_SQ8 / IVF_SQ8H,其中FLAT是精确索引,速度慢,但是有100%的召回率
index_param = {'index_type': IndexType.FLAT, 'nlist': 128}
milvus.create_index('mytable', index_param)
 
# 把向量添加到刚才建立的表格中
# ids可以为None,使用自动生成的id
status, ids = milvus.add_vectors(table_name="mytable",records=vectors_list,ids=None) # 返回这一组向量的ID
 
# 官方建议 向量插入结束后,相同的索引需要手动再创建一次
milvus.create_index('mytable', index_param)
 
# 输出一些统计信息
status, tables = milvus.show_tables()
print("所有的表格:",tables)
print("表格的数据量(行):{}".format((milvus.count_table('mytable')[1])))
print("mytable表格是否存在:",milvus.has_table("mytable")[1])
 
# 加载表格到内存
milvus.preload_table('mytable')
 
# 创建查询向量
query_vec_array = np.random.rand(1,vec_dim)
query_vec_list = query_vec_array.tolist()
# 进行查询, 注意这里的参数nprobe和建立索引时的参数nlist 会因为索引类型不同而影响到查询性能和查询准确率
# 对于 FLAT类型索引,两个参数对结果和速度没有影响
status, results = milvus.search(table_name='mytable', query_records=query_vec_list, top_k=4, nprobe=16)
print(status)
print(results)
 
 
 
# 删除表格和索引, 不删除的话,下一次还可以继续使用
milvus.drop_index(table_name="mytable")
milvus.delete_table(table_name="mytable")
 
# 断开连接
milvus.disconnect()
posted @ 2020-12-17 09:16  悲惨痛苦太刀  阅读(9432)  评论(2编辑  收藏  举报