Partitioned HNSW Deeplake Side Changes. #2847
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🚀 🚀 Pull Request
This PR is the deeplake side implementation of the Partitioned HNSW. In case of Partititoned HNSW we divide the HNSW into number of partition. This is done when the data is large and it has to scale. HNSW is not scalable, so in order to accommodate large of of data Partitioning is a way out.
Partitions are defined in index params. For e.g. we are creating 5 partitions and if the dataset is having 1000000 rows then each partition will have 200000 rows.
Through VectorStore API.
vs = VectorStore(
path=dest,
exec_option="compute_engine",
index_params={"threshold": 1, "distance_metric": "COS", "additional_params": {
"efConstruction": 200,
"M": 16,
"partitions": 5,
}},
token = TOKEN,
verbose=True,
overwrite= True,
)
Through Deeplake API.
ds = vs.dataset.
params = {
"efConstruction": 200,
"M": 16,
"partitions": 32,
}
ds.embedding.create_vdb_index("hnsw_1", distance="cosine_similarity", additional_params = params)
While doing query there is no change and TQL will be fired to all the partitions simultenously. The best match will be responded back.
Incremental index maintenance is enabled for partitioned hnsw. In case of new row Addition, Update or Remove of Top most rows the partitioned hnsw is automatically maintained.
In order to delete the partitioned hnsw index
ds.embedding.delete_vdb_index("hnsw_1")
Impact
Partitioned indexes are much faster to create and have high recall impact. Whenever indexing has to be done at scale, this feature is helpful.