Hnswlib 介绍与入门使用

Hnswlib是一个强大的近邻搜索(ANN)库, 官方介绍 Header-only C++ HNSW implementation with python bindings, insertions and updates. 热门的向量数据库Milvus底层的ANN库之一就是Hnswlib, 为milvus提供HNSW检索。

HNSW 原理#

HNSW 原理

将节点划分成不同层级,贪婪地遍历来自上层的元素,直到达到局部最小值,然后切换到下一层,以上一层中的局部最小值作为新元素重新开始遍历,直到遍历完最低一层。

安装使用#

从源码安装:

Copy
apt-get install -y python-setuptools python-pip git clone https://github.com/nmslib/hnswlib.git cd hnswlib pip install .

或者直接pip安装 pip install hnswlib

python 使用#

Copy
import hnswlib import numpy as np dim = 16 num_elements = 10000 # Generating sample data data = np.float32(np.random.random((num_elements, dim))) # We split the data in two batches: data1 = data[:num_elements // 2] data2 = data[num_elements // 2:] # Declaring index p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip # Initializing index # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded # during insertion of an element. # The capacity can be increased by saving/loading the index, see below. # # ef_construction - controls index search speed/build speed tradeoff # # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction p.init_index(max_elements=num_elements//2, ef_construction=100, M=16) # Controlling the recall by setting ef: # higher ef leads to better accuracy, but slower search p.set_ef(10) # Set number of threads used during batch search/construction # By default using all available cores p.set_num_threads(4) print("Adding first batch of %d elements" % (len(data1))) p.add_items(data1) # Query the elements for themselves and measure recall: labels, distances = p.knn_query(data1, k=1) print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n") # Serializing and deleting the index: index_path='first_half.bin' print("Saving index to '%s'" % index_path) p.save_index("first_half.bin") del p # Re-initializing, loading the index p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function. print("\nLoading index from 'first_half.bin'\n") # Increase the total capacity (max_elements), so that it will handle the new data p.load_index("first_half.bin", max_elements = num_elements) print("Adding the second batch of %d elements" % (len(data2))) p.add_items(data2) # Query the elements for themselves and measure recall: labels, distances = p.knn_query(data, k=1) print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

依次介绍:

distances#

支持三种距离算法, l2, ip内积,以及cos。

Distance parameter Equation
Squared L2 'l2' d = sum((Ai-Bi)^2)
Inner product 'ip' d = 1.0 - sum(Ai*Bi)
Cosine similarity 'cosine' d = 1.0 - sum(AiBi) / sqrt(sum(AiAi) * sum(Bi*Bi))

API#

定义 index#

Copy
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip

space 指定Distance算法,dim是向量的维度。

初始化索引#

Copy
p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)
  • max_elements - 最大容量 (capacity),如果插入数据超过容量会报异常,可以动态扩容
  • ef_construction - 平衡索引构建速度和搜索准确率,ef_construction越大,准确率越高但是构建速度越慢。 ef_construction 提高并不能无限增加索引的质量,常见的 ef_constructio n 参数为 128。
  • M - 表示在建表期间每个向量的边数目量,M会影响内存消耗,M越高,内存占用越大,准确率越高,同时构建速度越慢。通常建议设置在 8-32 之间。

添加数据与查询数据#

Copy
# Controlling the recall by setting ef: # higher ef leads to better accuracy, but slower search p.set_ef(10) # Set number of threads used during batch search/construction # By default using all available cores p.set_num_threads(4) print("Adding first batch of %d elements" % (len(data1))) p.add_items(data1) # Query the elements for themselves and measure recall: labels, distances = p.knn_query(data1, k=1) print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")
  • p.set_ef(10):设置搜索时的最大近邻数量(ef),即在构建索引时最多保留多少个近邻。较高的ef值会导致更好的准确率,但搜索速度会变慢。
  • p.set_num_threads(4):设置在批量搜索和构建索引过程中使用的线程数。默认情况下,使用所有可用的核心。
  • p.add_items(data1):将数据添加到索引中。
  • labels, distances = p.knn_query(data1, k=1):对数据中的每个元素进行查询,找到与其最近的邻居,返回邻居的标签和距离。

保持与加载索引#

Copy
# Serializing and deleting the index: index_path='first_half.bin' print("Saving index to '%s'" % index_path) p.save_index("first_half.bin") del p # Re-initializing, loading the index p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function. print("\nLoading index from 'first_half.bin'\n") # Increase the total capacity (max_elements), so that it will handle the new data p.load_index("first_half.bin", max_elements = num_elements) print("Adding the second batch of %d elements" % (len(data2))) p.add_items(data2) # Query the elements for themselves and measure recall: labels, distances = p.knn_query(data, k=1) print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")
  • 通过save_index保存索引
  • 然后load_index重新加载索引,只要未超过max_elements,可以再次add_items

C++使用#

官方提供了C++ 例子,创建索引、插入元素、搜索和序列化

Copy
#include "../../hnswlib/hnswlib.h" int main() { int dim = 16; // Dimension of the elements int max_elements = 10000; // Maximum number of elements, should be known beforehand int M = 16; // Tightly connected with internal dimensionality of the data // strongly affects the memory consumption int ef_construction = 200; // Controls index search speed/build speed tradeoff // Initing index hnswlib::L2Space space(dim); hnswlib::HierarchicalNSW<float>* alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, max_elements, M, ef_construction); // Generate random data std::mt19937 rng; rng.seed(47); std::uniform_real_distribution<> distrib_real; float* data = new float[dim * max_elements]; for (int i = 0; i < dim * max_elements; i++) { data[i] = distrib_real(rng); } // Add data to index for (int i = 0; i < max_elements; i++) { alg_hnsw->addPoint(data + i * dim, i); } // Query the elements for themselves and measure recall float correct = 0; for (int i = 0; i < max_elements; i++) { std::priority_queue<std::pair<float, hnswlib::labeltype>> result = alg_hnsw->searchKnn(data + i * dim, 1); hnswlib::labeltype label = result.top().second; if (label == i) correct++; } float recall = correct / max_elements; std::cout << "Recall: " << recall << "\n"; // Serialize index std::string hnsw_path = "hnsw.bin"; alg_hnsw->saveIndex(hnsw_path); delete alg_hnsw; // Deserialize index and check recall alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, hnsw_path); correct = 0; for (int i = 0; i < max_elements; i++) { std::priority_queue<std::pair<float, hnswlib::labeltype>> result = alg_hnsw->searchKnn(data + i * dim, 1); hnswlib::labeltype label = result.top().second; if (label == i) correct++; } recall = (float)correct / max_elements; std::cout << "Recall of deserialized index: " << recall << "\n"; delete[] data; delete alg_hnsw; return 0; }

Milvus 使用#

milvus 通过cgo调用knowhere,knowhere是一个向量检索的抽象封装,集成了FAISS, HNSW等开源ANN库。

knowhere 是直接将hnswlib代码引入,使用hnswlib的代码在
https://github.com/zilliztech/knowhere/blob/main/src/index/hnsw/hnsw.cc

主要是基于hnswlib的C接口,实现HnswIndexNode

Copy
namespace knowhere { class HnswIndexNode : public IndexNode { public: HnswIndexNode(const int32_t& /*version*/, const Object& object) : index_(nullptr) { search_pool_ = ThreadPool::GetGlobalSearchThreadPool(); } Status Train(const DataSet& dataset, const Config& cfg) override { auto rows = dataset.GetRows(); auto dim = dataset.GetDim(); auto hnsw_cfg = static_cast<const HnswConfig&>(cfg); hnswlib::SpaceInterface<float>* space = nullptr; if (IsMetricType(hnsw_cfg.metric_type.value(), metric::L2)) { space = new (std::nothrow) hnswlib::L2Space(dim); } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::IP)) { space = new (std::nothrow) hnswlib::InnerProductSpace(dim); } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::COSINE)) { space = new (std::nothrow) hnswlib::CosineSpace(dim); } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::HAMMING)) { space = new (std::nothrow) hnswlib::HammingSpace(dim); } else if (IsMetricType(hnsw_cfg.metric_type.value(), metric::JACCARD)) { space = new (std::nothrow) hnswlib::JaccardSpace(dim); } else { LOG_KNOWHERE_WARNING_ << "metric type not support in hnsw: " << hnsw_cfg.metric_type.value(); return Status::invalid_metric_type; } auto index = new (std::nothrow) hnswlib::HierarchicalNSW<float>(space, rows, hnsw_cfg.M.value(), hnsw_cfg.efConstruction.value()); if (index == nullptr) { LOG_KNOWHERE_WARNING_ << "memory malloc error."; return Status::malloc_error; } if (this->index_) { delete this->index_; LOG_KNOWHERE_WARNING_ << "index not empty, deleted old index"; } this->index_ = index; return Status::success; } Status Add(const DataSet& dataset, const Config& cfg) override { // ... std::atomic<uint64_t> counter{0}; uint64_t one_tenth_row = rows / 10; for (int i = 1; i < rows; ++i) { futures.emplace_back(build_pool->push([&, idx = i]() { index_->addPoint(((const char*)tensor + index_->data_size_ * idx), idx); uint64_t added = counter.fetch_add(1); if (added % one_tenth_row == 0) { LOG_KNOWHERE_INFO_ << "HNSW build progress: " << (added / one_tenth_row) << "0%"; } })); } // ... }

其他实现#

关注作者

欢迎关注作者微信公众号, 一起交流软件开发:欢迎关注作者微信公众号

posted @   JadePeng  阅读(3099)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 记一次.NET内存居高不下排查解决与启示
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
点击右上角即可分享
微信分享提示
CONTENTS