Hnswlib - fast approximate nearest neighbor search

Header-only C++ HNSW implementation with python bindings.

NEWS:

  • Hnswlib is now 0.5.2. Bugfixes - thanks @marekhanus for fixing the missing arguments, adding support for python 3.8, 3.9 in Travis, improving python wrapper and fixing typos/code style; @apoorv-sharma for fixing the bug int the insertion/deletion logic; @shengjun1985 for simplifying the memory reallocation logic; @TakaakiFuruse for improved description of add_items@psobotfor improving error handling; @ShuAiii for reporting the bug in the python interface

  • Hnswlib is now 0.5.0. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to @dbespalov@dyashuni@groodt,@uestc-lfs@vinnitu@fabiencastan@JinHai-CN@js1010!

  • Thanks to Apoorv Sharma @apoorv-sharma, hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).

  • Thanks to Dmitry @2ooom, hnswlib got a boost in performance for vector dimensions that are not multiple of 4

  • Thanks to Louis Abraham (@louisabraham) hnswlib can now be installed via pip!

Highlights:

  1. Lightweight, header-only, no dependencies other than C++ 11.
  2. Interfaces for C++, python and R (https://github.com/jlmelville/rcpphnsw).
  3. Has full support for incremental index construction. Has support for element deletions (currently, without actual freeing of the memory).
  4. Can work with custom user defined distances (C++).
  5. Significantly less memory footprint and faster build time compared to current nmslib's implementation.

Description of the algorithm parameters can be found in ALGO_PARAMS.md.

Python bindings

Supported distances:

Distance parameter Equation
Squared L2 'l2' d = sum((Ai-Bi)^2)
Inner product 'ip' d = 1.0 - sum(Ai*Bi)
Cosine similarity 'cosine' d = 1.0 - sum(Ai*Bi) / sqrt(sum(Ai*Ai) * sum(Bi*Bi))

Note that inner product is not an actual metric. An element can be closer to some other element than to itself. That allows some speedup if you remove all elements that are not the closest to themselves from the index.

For other spaces use the nmslib library https://github.com/nmslib/nmslib.

Short API description

  • hnswlib.Index(space, dim) creates a non-initialized index an HNSW in space space with integer dimension dim.

hnswlib.Index methods:

  • init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100) initializes the index from with no elements.

    • max_elements defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
    • ef_construction defines a construction time/accuracy trade-off (see ALGO_PARAMS.md).
    • M defines tha maximum number of outgoing connections in the graph (ALGO_PARAMS.md).
  • add_items(data, ids, num_threads = -1) - inserts the data(numpy array of vectors, shape:N*dim) into the structure.

    • num_threads sets the number of cpu threads to use (-1 means use default).
    • ids are optional N-size numpy array of integer labels for all elements in data.
      • If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
    • Thread-safe with other add_items calls, but not with knn_query.
  • mark_deleted(label) - marks the element as deleted, so it will be omitted from search results.

  • resize_index(new_size) - changes the maximum capacity of the index. Not thread safe with add_items and knn_query.

  • set_ef(ef) - sets the query time accuracy/speed trade-off, defined by the ef parameter ( ALGO_PARAMS.md). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.

  • knn_query(data, k = 1, num_threads = -1) make a batch query for k closest elements for each element of the

    • data (shape:N*dim). Returns a numpy array of (shape:N*k).
    • num_threads sets the number of cpu threads to use (-1 means use default).
    • Thread-safe with other knn_query calls, but not with add_items.
  • load_index(path_to_index, max_elements = 0) loads the index from persistence to the uninitialized index.

    • max_elements(optional) resets the maximum number of elements in the structure.
  • save_index(path_to_index) saves the index from persistence.

  • set_num_threads(num_threads) set the default number of cpu threads used during data insertion/querying.

  • get_items(ids) - returns a numpy array (shape:N*dim) of vectors that have integer identifiers specified in ids numpy vector (shape:N). Note that for cosine similarity it currently returns normalized vectors.

  • get_ids_list() - returns a list of all elements' ids.

  • get_max_elements() - returns the current capacity of the index

  • get_current_count() - returns the current number of element stored in the index

Read-only properties of hnswlib.Index class:

  • space - name of the space (can be one of "l2", "ip", or "cosine").

  • dim - dimensionality of the space.

  • M - parameter that defines the maximum number of outgoing connections in the graph.

  • ef_construction - parameter that controls speed/accuracy trade-off during the index construction.

  • max_elements - current capacity of the index. Equivalent to p.get_max_elements().

  • element_count - number of items in the index. Equivalent to p.get_current_count().

Properties of hnswlib.Index that support reading and writin

  • ef - parameter controlling query time/accuracy trade-off.

  • num_threads - default number of threads to use in add_items or knn_query. Note that calling p.set_num_threads(3) is equivalent to p.num_threads=3.

Python bindings examples

  1. import hnswlib
  2. import numpy as np
  3. import pickle
  4.  
  5. dim = 128
  6. num_elements = 10000
  7.  
  8. # Generating sample data
  9. data = np.float32(np.random.random((num_elements, dim)))
  10. ids = np.arange(num_elements)
  11.  
  12. # Declaring index
  13. p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip
  14.  
  15. # Initializing index - the maximum number of elements should be known beforehand
  16. p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
  17.  
  18. # Element insertion (can be called several times):
  19. p.add_items(data, ids)
  20.  
  21. # Controlling the recall by setting ef:
  22. p.set_ef(50) # ef should always be > k
  23.  
  24. # Query dataset, k - number of closest elements (returns 2 numpy arrays)
  25. labels, distances = p.knn_query(data, k = 1)
  26.  
  27. # Index objects support pickling
  28. # WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
  29. # Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load
  30. p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip
  31.  
  32. ### Index parameters are exposed as class properties:
  33. print(f"Parameters passed to constructor: space={p_copy.space}, dim={p_copy.dim}")
  34. print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
  35. print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
  36. print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")

An example with updates after serialization/deserialization:

  1. import hnswlib
  2. import numpy as np
  3.  
  4. dim = 16
  5. num_elements = 10000
  6.  
  7. # Generating sample data
  8. data = np.float32(np.random.random((num_elements, dim)))
  9.  
  10. # We split the data in two batches:
  11. data1 = data[:num_elements // 2]
  12. data2 = data[num_elements // 2:]
  13.  
  14. # Declaring index
  15. p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip
  16.  
  17. # Initializing index
  18. # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
  19. # during insertion of an element.
  20. # The capacity can be increased by saving/loading the index, see below.
  21. #
  22. # ef_construction - controls index search speed/build speed tradeoff
  23. #
  24. # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)
  25. # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction
  26.  
  27. p.init_index(max_elements=num_elements//2, ef_construction=100, M=16)
  28.  
  29. # Controlling the recall by setting ef:
  30. # higher ef leads to better accuracy, but slower search
  31. p.set_ef(10)
  32.  
  33. # Set number of threads used during batch search/construction
  34. # By default using all available cores
  35. p.set_num_threads(4)
  36.  
  37. print("Adding first batch of %d elements" % (len(data1)))
  38. p.add_items(data1)
  39.  
  40. # Query the elements for themselves and measure recall:
  41. labels, distances = p.knn_query(data1, k=1)
  42. print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")
  43.  
  44. # Serializing and deleting the index:
  45. index_path='first_half.bin'
  46. print("Saving index to '%s'" % index_path)
  47. p.save_index("first_half.bin")
  48. del p
  49.  
  50. # Re-initializing, loading the index
  51. p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function.
  52.  
  53. print("\nLoading index from 'first_half.bin'\n")
  54.  
  55. # Increase the total capacity (max_elements), so that it will handle the new data
  56. p.load_index("first_half.bin", max_elements = num_elements)
  57.  
  58. print("Adding the second batch of %d elements" % (len(data2)))
  59. p.add_items(data2)
  60.  
  61. # Query the elements for themselves and measure recall:
  62. labels, distances = p.knn_query(data, k=1)
  63. print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n")

Bindings installation

You can install from sources:

  1. apt-get install -y python-setuptools python-pip
  2. git clone https://github.com/nmslib/hnswlib.git
  3. cd hnswlib
  4. pip install .

or you can install via pip: pip install hnswlib

Other implementations

Contributing to the repository

Contributions are highly welcome!

Please make pull requests against the develop branch.

200M SIFT test reproduction

To download and extract the bigann dataset (from root directory):

  1. python3 download_bigann.py

To compile:

  1. mkdir build
  2. cd build
  3. cmake ..
  4. make all

To run the test on 200M SIFT subset:

  1. ./main

The size of the BigANN subset (in millions) is controlled by the variable subset_size_millions hardcoded in sift_1b.cpp.

hnsw的更多相关文章

  1. Xamarin.iOS开发初体验

    aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAKwAAAA+CAIAAAA5/WfHAAAJrklEQVR4nO2c/VdTRxrH+wfdU84pW0

随机推荐

  1. 2、flex最后不对齐问题

    https://www.zhangxinxu.com/wordpress/2019/08/css-flex-last-align/

  2. Column count doesn't match value count at row 1存储的数据与数据库表的字段类型定义不相匹配

    一.造成这个原因可能是一个关于创建json数据类型的mysql表格插入的一个报错提示: 26行为错误示范:27是正确书写规范.

  3. 自定义一个JdbcTemplate(增删改数据库中表记录)

    需求: 自定义一个JdbcTemplate模板,实现增删改数据库中表记录的功能 1 package demo03; 2 3 import utils.JDBC_DBCP_Utils; 4 5 impo ...

  4. 采集存储计算处理卡设计原理图:619-基于6U VPX的双FMC ZU19EG 采集存储计算处理卡

    619-基于6U VPX的双FMC ZU19EG 采集存储计算处理卡   基于6U VPX的双FMC ZU19EG 采集存储计算处理卡 一.板卡概述 该板卡是采集.存储.计算.管理一体的高集成度.加固 ...

  5. reids 启动方法

    ---恢复内容开始--- 在windows环境下启动redis服务,前提是你安装好了,启动如下: 一,进入redis的安装目录下,在地址栏输入"cmd",回车 二,然后会进入cmd ...

  6. [PHP]流程控制的替代语法:endif/endwhile/endfor使用介绍

    我们经常在wordpress一类博客程序的模板里面看到很多奇怪的PHP语法,比如: 代码如下: <?php if(empty($GET_['a'])): ?> <font color ...

  7. python实现两张图片拼接

    纵向拼接 from PIL import Image def image_splicing(pic01, pic02): """ 图片拼接 :param pic01: 图 ...

  8. vim 小记录

    将str1批量替换成str2 , 特殊符号前用转译符 \ :%s/str1/str2/g

  9. mysql查询数据是否连续增长

    记录一次比较查询,需求是比较内容是否一次比一次高,用来作为标签依据 大致问题如下 简化: 班级中有若干人,若干次考试.需要查询某人在考试时成绩越来越好(分数是每次都有增长) 思路: 1.使用group ...

  10. slitaz中tazpkg更改软件源

    在tazpkg手册中可以查到保存tazpkg软件源网址的文件,/var/lib/tazpkg/mirror