Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce optimized nndescent to build graph index #272

Open
wxyucs opened this issue Dec 27, 2024 · 1 comment · May be fixed by #287
Open

introduce optimized nndescent to build graph index #272

wxyucs opened this issue Dec 27, 2024 · 1 comment · May be fixed by #287
Assignees
Labels
kind/feature New feature or request version/0.13

Comments

@wxyucs
Copy link
Collaborator

wxyucs commented Dec 27, 2024

No description provided.

@wxyucs wxyucs added kind/feature New feature or request version/0.13 labels Dec 27, 2024
@wxyucs wxyucs changed the title introduce nndecent to build graph index introduce nndescent to build graph index Dec 27, 2024
@inabao
Copy link
Collaborator

inabao commented Dec 31, 2024

Algorithm Design and Related Issues

Building a proximal graph for search involves addressing two key issues: finding approximate K-nearest neighbors (KNN) and ensuring graph pruning while maintaining connectivity. For the former, NSG and NNDescent primarily obtain an approximate KNN graph through iterative methods, while HNSW and Vamana derive an optimal KNN graph mainly through search methods. For the latter, HNSW and Vamana ensure the reachability of each point by inserting reverse edges during the search process, whereas NSG ensures reachability through searches by constructing a tree and connecting each search node to the tree.

Despite this, the aforementioned methods still have some issues:

There are no constraints on the in-degree of nodes, which can result in some points having an in-degree of zero.
The edge pruning during the addition of reverse edges involves redundant calculations.
Concurrency performance is suboptimal.
Therefore, we aim to propose a new graph construction algorithm called Optimized NNDescent (ODescent) to explicitly address the above issues.

Algorithm Implementation

The algorithm mainly consists of three parts:

Finding the nearest neighbors through iterative methods.
The method for finding nearest neighbors relies on NNDescent, which iteratively refines the results.
In each iteration, checking the in-degree to ensure that no point has an in-degree below a specified threshold.
If a point A's in-degree falls below the threshold, point A replaces the furthest neighbor of its nearest neighbor B.
Pruning and adding reverse edges.
Edges are pruned based on the given parameter α, and reverse edges are connected to further ensure connectivity.
Experimental Results
GIST

#./build-release/tools/test_performance     '/tmp/dataset/gist-1m-960-euclidean.hdf5'     'build' 'diskann' '{"dim": 960, "dtype": "float32", "metric_type": "l2", "hnsw": {"max_degree": 12, "ef_construction": 100}, "diskann": {"max_degree": 24, "ef_construction": 100, "graph_type":"vamana", "alpha":1.2, "turn": 20, "pq_dims": 64, "pq_sample_rate": 0.1}}' '{"hnsw":{"ef_search":100},"diskann":{"ef_search":100,"beam_search":4,"io_limit":200,"use_reorder":true}}'
[2024-12-31 13:18:02.090] [info] diskann build full (graph) cost 2141.208s
[2024-12-31 13:20:05.831] [info] diskann build full (pq) cost 123.741s
[2024-12-31 13:20:11.570] [info] diskann build full (disk layout) cost 5.740s
[2024-12-31 13:20:19.164] [info] diskann serialize cost 2.466s
{
    "build_parameters": "{\"dim\": 960, \"dtype\": \"float32\", \"metric_type\": \"l2\", \"hnsw\": {\"max_degree\": 12, \"ef_construction\": 100}, \"diskann\": {\"max_degree\": 24, \"ef_construction\": 100, \"graph_type\":\"vamana\", \"alpha\":1.2, \"turn\": 20, \"pq_dims\": 64, \"pq_sample_rate\": 0.1}}",
    "build_time_in_second": 2275.817019665,
    "dataset": "/tmp/dataset/gist-1m-960-euclidean.hdf5",
    "index_name": "diskann",
    "num_base": 1000000,
    "tps": 439.40263710093876
}





#./build-release/tools/test_performance     '/tmp/dataset/gist-1m-960-euclidean.hdf5'     'build' 'diskann' '{"dim": 960, "dtype": "float32", "metric_type": "l2", "hnsw": {"max_degree": 12, "ef_construction": 100}, "diskann": {"max_degree": 24, "ef_construction": 100, "graph_type":"odescent", "alpha":1.2, "turn": 20, "pq_dims": 64, "pq_sample_rate": 0.1}}' '{"hnsw":{"ef_search":100},"diskann":{"ef_search":100,"beam_search":4,"io_limit":200,"use_reorder":true}}'
[2024-12-31 14:45:13.505] [info] odescent build full (graph) cost 2170.382s
[2024-12-31 14:46:57.184] [info] diskann build full (pq) cost 103.679s
[2024-12-31 14:47:00.827] [info] diskann build full (disk layout) cost 3.643s
[2024-12-31 14:47:04.318] [info] diskann serialize cost 921.343ms
{
    "build_parameters": "{\"dim\": 960, \"dtype\": \"float32\", \"metric_type\": \"l2\", \"hnsw\": {\"max_degree\": 12, \"ef_construction\": 100}, \"diskann\": {\"max_degree\": 24, \"ef_construction\": 100, \"graph_type\":\"odescent\", \"alpha\":1.2, \"turn\": 20, \"pq_dims\": 64, \"pq_sample_rate\": 0.1}}",
    "build_time_in_second": 2280.273783652,
    "dataset": "/tmp/dataset/gist-1m-960-euclidean.hdf5",
    "index_name": "diskann",
    "num_base": 1000000,
    "tps": 438.543830644072
}

#./build-release/tools/test_performance     '/tmp/dataset/gist-1m-960-euclidean.hdf5'     'search' 'diskann' '{"dim": 960, "dtype": "float32", "metric_type": "l2", "hnsw": {"max_degree": 12, "ef_construction": 100}, "diskann": {"max_degree": 24, "ef_construction": 100, "graph_type":"vamana", "alpha":1.2, "turn": 20, "pq_dims": 64, "pq_sample_rate": 0.1}}' '{"hnsw":{"ef_search":100},"diskann":{"ef_search":100,"beam_search":4,"io_limit":200,"use_reorder":true}}'
[2024-12-31 14:48:47.351] [info] diskann deserialize cost 174.976ms
{
    "correct": 903,
    "dataset": "/tmp/dataset/gist-1m-960-euclidean.hdf5",
    "estimate_used_memory": 74952960,
    "index_name": "diskann",
    "memory": 125366272,
    "num_query": 1000,
    "qps": 722.1118474664544,
    "recall": 0.902999997138977,
    "search_parameters": "{\"hnsw\":{\"ef_search\":100},\"diskann\":{\"ef_search\":100,\"beam_search\":4,\"io_limit\":200,\"use_reorder\":true}}",
    "search_time_in_second": 1.384827023,
    "top_k": 1
}

SIFT

#./build-release/tools/test_performance     '/tmp/dataset/sift-1m-128-euclidean.hdf5'     'build' 'diskann' '{"dim": 128, "dtype": "float32", "metric_type": "l2", "hnsw": {"max_degree": 12, "ef_construction": 100}, "diskann": {"max_degree": 24, "ef_construction": 200, "graph_type":"vamana", "alpha":1.2, "turn": 20, "pq_dims": 64, "pq_sample_rate": 0.1}}' '{"hnsw":{"ef_search":100},"diskann":{"ef_search":100,"beam_search":4,"io_limit":200,"use_reorder":true}}'
[2024-12-31 15:22:07.361] [info] diskann build full (graph) cost 821.301s
[2024-12-31 15:24:25.050] [info] diskann build full (pq) cost 137.689s
[2024-12-31 15:24:26.217] [info] diskann build full (disk layout) cost 1.167s
[2024-12-31 15:24:27.774] [info] diskann serialize cost 444.411ms
{
    "build_parameters": "{\"dim\": 128, \"dtype\": \"float32\", \"metric_type\": \"l2\", \"hnsw\": {\"max_degree\": 12, \"ef_construction\": 100}, \"diskann\": {\"max_degree\": 24, \"ef_construction\": 200, \"graph_type\":\"vamana\", \"alpha\":1.2, \"turn\": 20, \"pq_dims\": 64, \"pq_sample_rate\": 0.1}}",
    "build_time_in_second": 961.269516893,
    "dataset": "/tmp/dataset/sift-1m-128-euclidean.hdf5",
    "index_name": "diskann",
    "num_base": 1000000,
    "tps": 1040.290971914083
}


#./build-release/tools/test_performance     '/tmp/dataset/sift-1m-128-euclidean.hdf5'     'search' 'diskann' '{"dim": 128, "dtype": "float32", "metric_type": "l2", "hnsw": {"max_degree": 12, "ef_construction": 100}, "diskann": {"max_degree": 24, "ef_construction": 200, "graph_type":"vamana", "alpha":1.2, "turn": 20, "pq_dims": 64, "pq_sample_rate": 0.1}}' '{"hnsw":{"ef_search":100},"diskann":{"ef_search":100,"beam_search":4,"io_limit":200,"use_reorder":true}}'
[2024-12-31 15:25:05.247] [info] diskann deserialize cost 173.148ms
{
    "correct": 9851,
    "dataset": "/tmp/dataset/sift-1m-128-euclidean.hdf5",
    "estimate_used_memory": 72393728,
    "index_name": "diskann",
    "memory": 117366784,
    "num_query": 10000,
    "qps": 897.6470347211786,
    "recall": 0.9850999712944031,
    "search_parameters": "{\"hnsw\":{\"ef_search\":100},\"diskann\":{\"ef_search\":100,\"beam_search\":4,\"io_limit\":200,\"use_reorder\":true}}",
    "search_time_in_second": 11.14023621,
    "top_k": 1
}

@inabao inabao changed the title introduce nndescent to build graph index introduce optimized nndescent to build graph index Dec 31, 2024
@inabao inabao linked a pull request Dec 31, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request version/0.13
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants