IndexFlatL2、IndexIVFFlat、IndexIVFPQ三种索引方式示例

　　上文针对Faiss安装和一些原理做了简单说明，本文针对标题所列三种索引方式进行编码验证。

　　首先生成数据集，这里采用100万条数据，每条50维，生成数据做本地化保存，代码如下：

import numpy as np

# 构造数据
import time
d = 50                           # dimension
nb = 1000000                     # database size
# nq = 1000000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
# xq = np.random.random((nq, d)).astype('float32')
# xq[:, 0] += np.arange(nq) / 1000.

print(xb[:1])

# 写入文件中
# file = open('data.txt', 'w')
np.savetxt('data.txt', xb)

　　在上述训练集的基础上，做自身查询，即本身即是Faiss的训练集也是查寻集，三个索引的查询方式在一个文件内，如下示例代码：

import numpy as np
import faiss

# 读取文件形成numpy矩阵
data = []
with open('data.txt', 'rb') as f:
    for line in f:
        temp = line.split()
        data.append(temp)
print(data[0])
# 训练与需要计算的数据
dataArray = np.array(data).astype('float32')

# print(dataArray[0])
# print(dataArray.shape[1])
# 获取数据的维度
d = dataArray.shape[1]

# IndexFlatL2索引方式
# # 为向量集构建IndexFlatL2索引，它是最简单的索引类型，只执行强力L2距离搜索
# index = faiss.IndexFlatL2(d)   # build the index
# index.add(dataArray)                  # add vectors to the index
#
# # we want to see 4 nearest neighbors
# k = 11
# # search
# D, I = index.search(dataArray, k)
#
# # neighbors of the 5 first queries
# print(I[:5])

# IndexIVFFlat索引方式
# nlist = 100 # 单元格数
# k = 11
# quantizer = faiss.IndexFlatL2(d)  # the other index  d是向量维度
# index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
# # here we specify METRIC_L2, by default it performs inner-product search
#
# assert not index.is_trained
# index.train(dataArray)
# assert index.is_trained
# index.add(dataArray)                  # add may be a bit slower as well
# index.nprobe = 10        # 执行搜索访问的单元格数（nlist以外）      # default nprobe is 1, try a few more
# D, I = index.search(dataArray, k)     # actual search
#
# print(I[:5]) # neighbors of the 5 last queries

# IndexIVFPQ索引方式
nlist = 100
m = 5
k = 11
quantizer = faiss.IndexFlatL2(d)  # this remains the same
# 为了扩展到非常大的数据集，Faiss提供了基于产品量化器的有损压缩来压缩存储的向量的变体。压缩的方法基于乘积量化。
# 损失了一定精度为代价， 自身距离也不为0， 这是由于有损压缩。
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, 8)
# 8 specifies that each sub-vector is encoded as 8 bits
index.train(dataArray)
index.add(dataArray)
# D, I = index.search(xb[:5], k) # sanity check
# print(I)
# print(D)
index.nprobe = 10              # make comparable with experiment above
D, I = index.search(dataArray, k)     # search
print(I[:5])

　　三种索引的结果和运行时长统计如下图所示：

　　从上述结果可以看出，加聚类后运行速度比暴力搜索提升很多，结果准确度也基本一致，加聚类加量化运行速度更快，结果相比暴力搜索差距较大，在数据量不是很大、维度不高的情况下，建议选择加聚类的索引方式即可。

posted @ 2019-03-21 08:31 yhzhou 阅读(11148) 评论(1) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

yhzhou

IndexFlatL2、IndexIVFFlat、IndexIVFPQ三种索引方式示例

公告