Skip to content

antgroup/vsag

Repository files navigation

vsag-pages

CircleCI codecov GitHub License GitHub Release GitHub Contributors arXiv

PyPI - Version PyPI - Python Version PyPI Downloads PyPI Downloads PyPI Downloads

What is VSAG

VSAG is a vector indexing library used for similarity search. The indexing algorithm allows users to search through various sizes of vector sets, especially those that cannot fit in memory. The library also provides methods for generating parameters based on vector dimensions and data scale, allowing developers to use it without understanding the algorithm’s principles. VSAG is written in C++ and provides a Python wrapper package called pyvsag.

Performance

The VSAG algorithm achieves a significant boost of efficiency and outperforms the previous state-of-the-art (SOTA) by a clear margin. Specifically, VSAG's QPS exceeds that of the previous SOTA algorithm, Glass, by over 100%, and the baseline algorithm, HNSWLIB, by over 300% according to the ann-benchmark result on the GIST dataset at 90% recall. The test in ann-benchmarks is running on an r6i.16xlarge machine on AWS with --parallelism 31, single-CPU, and hyperthreading disabled. The result is as follows:

gist-960-euclidean

Getting Started

Integrate with CMake

# CMakeLists.txt
cmake_minimum_required(VERSION 3.11)

project (myproject)

set (CMAKE_CXX_STANDARD 11)

# download and compile vsag
include (FetchContent)
FetchContent_Declare (
  vsag
  GIT_REPOSITORY https://github.com/antgroup/vsag
  GIT_TAG main
)
FetchContent_MakeAvailable (vsag)
include_directories (vsag-cmake-example PRIVATE ${vsag_SOURCE_DIR}/include)

# compile executable and link to vsag
add_executable (vsag-cmake-example src/main.cpp)
target_link_libraries (vsag-cmake-example PRIVATE vsag)

# add dependency
add_dependencies (vsag-cmake-example vsag)

Examples

Currently Python and C++ examples are provided, please explore examples directory for details.

We suggest you start with 101_index_hnsw.cpp and example_hnsw.py.

Building from Source

Please read the DEVELOPMENT guide for instructions on how to build.

Who's Using VSAG

vsag_users

If your system uses VSAG, then feel free to make a pull request to add it to the list.

How to Contribute

Although VSAG is initially developed by the Vector Database Team at Ant Group, it's the work of the community, and contributions are always welcome! See CONTRIBUTING for ways to get started.

Community

Discord

Thrive together in VSAG community with users and developers from all around the world.

Roadmap

  • v0.13 (ETA: Jan. 2025)
    • introduce new index AllScanIndex that supports brute force search and read raw vector
    • support in-place update on HNSW
    • support automatically optimization on Graph
  • v0.14 (ETA: Mar. 2025)
    • support inverted index(be like IVFFlat) based on datacell
    • support extrainfo storage within vector
    • implement a new MultiIndex that supports efficient pre-filtering on enumerable tags
  • v0.15 (ETA: Apr. 2025)
    • support sparse vector searching
    • introduce pluggable product quantization(known as PQ) in datacell

Reference

Reference to cite when you use VSAG in a research paper:

@article{Yang2024EffectiveAG,
  title={Effective and General Distance Computation for Approximate Nearest Neighbor Search},
  author={Mingyu Yang and Wentao Li and Jiabao Jin and Xiaoyao Zhong and Xiangyu Wang and Zhitao Shen and Wei Jia and Wei Wang},
  year={2024},
  url={https://arxiv.org/abs/2404.16322}
}

For the implementation of RaBitQ quantization in VSAG, please cite:

@article{10.1145/3654970,
author = {Gao, Jianyang and Long, Cheng},
title = {RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search},
year = {2024},
issue_date = {June 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {2},
number = {3},
url = {https://doi.org/10.1145/3654970},
doi = {10.1145/3654970},
abstract = {Searching for approximate nearest neighbors (ANN) in the high-dimensional Euclidean space is a pivotal problem. Recently, with the help of fast SIMD-based implementations, Product Quantization (PQ) and its variants can often efficiently and accurately estimate the distances between the vectors and have achieved great success in the in-memory ANN search. Despite their empirical success, we note that these methods do not have a theoretical error bound and are observed to fail disastrously on some real-world datasets. Motivated by this, we propose a new randomized quantization method named RaBitQ, which quantizes D-dimensional vectors into D-bit strings. RaBitQ guarantees a sharp theoretical error bound and provides good empirical accuracy at the same time. In addition, we introduce efficient implementations of RaBitQ, supporting to estimate the distances with bitwise operations or SIMD-based operations. Extensive experiments on real-world datasets confirm that (1) our method outperforms PQ and its variants in terms of accuracy-efficiency trade-off by a clear margin and (2) its empirical performance is well-aligned with our theoretical analysis.},
journal = {Proc. ACM Manag. Data},
month = may,
articleno = {167},
numpages = {27},
keywords = {Johnson-Lindenstrauss transformation, approximate nearest neighbor search, quantization}
}

Star History

Star History Chart

License

Apache License 2.0