-
Notifications
You must be signed in to change notification settings - Fork 172
Description
Is your feature request related to a problem?
Disk-based vector search was added in OpenSearch 2.17. This feature uses binary quantization to decrease the memory footprint of vector search while keeping recall high. While disk-based vector search is successful at keeping recall high for some datasets, other data sets have recall less than 0.9 out of the box. We can use the two techniques discussed in this RFC, Asymmetric Distance Computation (ADC) and Random Rotation (RR), to increase the recall on some of these datasets.
In order to maintain high recall, disk-based search uses two phases. The first phase searches a binary quantized index for an oversampled number of neighbors. The second phase rescores and reduces these oversampled neighbors using exact search.
Currently, the first phase quantizes each 4-byte floating point entry in a vector into a single bit, then computes its distance to neighbors using hamming distance.
While this approach achieves high recall on some datasets, it does not perform well on others.
We can see that for some datasets, the recall for on-disk (32x compression) is above 0.9 out of the box:
Data set | Recall@k |
---|---|
minillm-msmarco | 0.90 |
mpnet-msmarco | 0.92 |
mxbai-msmarco | 0.95 |
snowflake-msmarco | 0.92 |
However, the success of this approach is dataset dependent. There are data sets that are well below 0.9 out of the box:
Data set | Recall@k |
---|---|
sift | 0.39 |
glove | 0.42 |
gist | 0.12 |
cohere-v2-wiki | 0.57 |
cohere-v3-bioasq | 0.71 |
tasb-msmarco | 0.87 |
e5small-msmarco | 0.80 |
clip-flickr-image-queries | 0.81 |
cohere-v2-dbpedia | 0.76 |
In general, recall can be improved by increasing number of documents we rescore (via increasing oversample_factor
). However, because rescoring has a random access pattern to disk, more rescoring leads to slower latencies. So recall is improved, but the recall/search latency tradeoff is not.
What solution would you like?
Reduce the impact of quantization errors through combining two separate optimizations:
- Asymmetric Distance Computations (ADC)
- Random Rotation (RR)
I am opening as a single RFC instead of several because the recall benefits of ADC+RR are much higher than the recall benefits of ADC or RR separately. There is also a search latency tradeoff for these techniques.
Asymmetric Distance Computation (ADC)
Binary quantization introduces error in two different stages: 1), when the document vectors are quantized during ingestion, and 2), when the query vectors are quantized during search.
While stage 1) is necessary to achieve memory savings due to the number of documents, stage 2) is less necessary for reducing memory requirements. For instance, a full-precision 768-dimension float32 vector will use around 3 kB memory, but the same binary-quantized vector will use 100 B memory. While this is a 32x difference, the scale is small compared to the ~3 gB required to index 1M binary quantized documents (the added overhead from 3 kB vs 100B is not significant).
We can use ADC to preserve the full-precision vectors during search. This cuts down the error introduced in stage 2).
At a very high level, ADC scales our query vector close to 0 and 1. Each entry q_d
in the query vector is modified via: q_d = (q_d - x_d) / (y_d - x_d)
. x_d
is the mean of all document entries quantized to 0
(the below threshold mean) and y_d
is the mean of all document entries quantized to 1
(the above threshold mean). Intuitively, the query is transformed through a linear interpolation of the above and below threshold means. This implies that an entry in our query vector that is very close to the below threshold mean will be transformed to a value near 0 and an entry close to the above threshold mean will be close to 1. However, this scheme allows for values besides just 0 and 1 which is where the better accuracy comes from.
Random Rotation (RR)
One fact about binary quantization (a type of scalar quantization) is that each entry is given equal weight during the quantization process. This leads to undesirable effects if the scale of variance is different in different dimensions. For illustration imagine we have a 2-dimensional data distribution where the first dimension (x axis) has small variance and the second dimension (y axis) has high variance. The position of a vector in the second dimension may have more “information“ than its position in the first dimension, but binary quantization treats each dimension as equally important. However, we can rotate the distribution and some of the variance is ”smoothed“ from the second dimension into the first dimension. See below for an illustration.
Random rotation can help smooth uneven variance across dimensions, which improves the effectiveness of k-NN search under binary quantization, since binary quantization treats all dimensions equally.
What alternatives have you considered?
- Increase oversampling factor to achieve 0.9 recall out of box for the above datasets.
The issue with this alternative is that it is naive and does not address the fundamental sources of error introduced in binary quantization (it does not account for uneven information across dimensions and still introduces quantization error in two stages). It will also increase the amount of random disk IO for fetching the full-precision vectors. I believe, based on the fact that we are achieving >0.9 recall on some datasets and <0.9 on others out of the box, that solely increasing oversampling factor is not the correct next step to improve the recall for disk-based vector search.
Benchmark Results
Benchmark Setup
Three sets of experiments were performed on the several datasets we've used to benchmark disk-based vector search in the past:- Random rotation (RR).
- Asymmetric distance calculation (ADC).
- Asymmetric distance calculation and random rotation (ADC+RR).
- m=16
- ef_construction=256
- ef_search=256
- oversample_factor = 5.0 if dimensions < 1000 else 3.0
Note that these experiments were conducted on a single shard/single segment one-machine cluster. We expect recall to be higher for multiple segments and multiple shard setups since there will be more hnsw graphs constructed in those cases (as opposed to a single hnsw graph).
Recall Changes
The following plots show recall differences between baseline k-NN and enhanced k-NN.From left to right: Random rotation vs baseline, ADC vs baseline, ADC&Random Rotation vs baseline. (click to enlarge).
ADC+RR leads to good recall gains for some datasets, especially for datasets with a low recall (a high number of false positives). However, ADC+RR does not show improvements across the board and there are some datasets with small regressions.
Here is an enlarged recall plot for (ADC+RR):
Latency impact
There is variation in search latency numbers from run-to-run but the general trend is meaningful. ADC increases search latency.The average search latency impact across the above datasets is
17.6%
.
Indexing Impact
The random rotation technique adds some additional overhead during indexing. When we quantize the vectors we first rotate them. The impact on p90 indexing latencies is mostly negligible.The main performance hit comes when we force merge segments into a single segment. The reason for this is that all vectors must be rotated before being sent to the faiss index. We do not store an extra copy of the rotated vectors due to the memory overhead and engineering complexity. One optimization we could look into implementing is to store rotated vectors in the source, then undo the rotation before returning the vectors. Alternatively we could simply store only the rotated vectors (if the user only cares about the document ids of the vectors and not the vectors themselves). Already k-NN does this something similar to support cosine similarity in faiss (store normalized vectors so IP and cosine similarity are equivalent).
Dataset | Mode | Model | Model type | dimension | Space type | Recall@100 (Enhanced) | Recall@100 (Baseline) | p90 query latency (Enhanced) | p90 query latency (Baseline) | P90 Latency Changes % | p90 indexing % change | Force merge time % change | p90 indexing (Enhanced) | p90 indexing (Baseline) | Force merge time (Enhanced) | Force merge time (Baseline) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
coherev2-10m | on_disk | Cohere v2 | Text Embeddings | 768.0 | inner product | 0.52 | 0.57 | 54.7 | 60.2 | -9.1362 | -13.4746 | 177.7311 | 102.1 | 118.0 | 198.3 | 71.4 |
e5_embeddings | on_disk | e5.small | Text Embeddings | 384.0 | l2 | 0.88 | 0.8 | 25.8 | 19.1 | 35.0785 | -12.6792 | 45.0467 | 115.7 | 132.5 | 77.6 | 53.5 |
flickr-image-queries | on_disk | Clip | Image Embeddings | 512.0 | l2 | 0.93 | 0.81 | 57.3 | 51.7 | 10.8317 | -6.5632 | 93.9394 | 78.3 | 83.8 | 64.0 | 33.0 |
gist-960-euclidean | on_disk | Gist descriptor | Image Embeddings | 960.0 | l2 | 0.36 | 0.12 | 30.0 | 40.1 | -25.187 | -8.9844 | 412.7273 | 69.9 | 76.8 | 28.2 | 5.5 |
glove-200-angular | on_disk | Glove word embedding model | Text Embeddings | 200.0 | cosine | 0.57 | 0.42 | 4.5 | 4.8 | -6.25 | -16.1695 | 2.7397 | 53.4 | 63.7 | 7.5 | 7.3 |
marco_tasb | on_disk | tasb | Text Embeddings | 768.0 | inner product | 0.86 | 0.87 | 5.6 | 4.5 | 24.4444 | -31.063 | 300.0 | 66.8 | 96.9 | 23.2 | 5.8 |
mbread_msmarco | on_disk | minillm | Text Embeddings | 384.0 | l2 | 0.95 | 0.9 | 6.7 | 3.36 | 99.4048 | 263.8232 | 89.0 | 35.4 | 9.73 | ||
mini_msmarco | on_disk | mxbai | Text Embeddings | 1024.0 | cosine | 0.96 | 0.9 | 4.0 | 3.9 | 2.5641 | -3.2872 | 45.614 | 55.9 | 57.8 | 8.3 | 5.7 |
mpnet_marco | on_disk | mpnet | Text Embeddings | 768.0 | l2 | 0.98 | 0.95 | 6.8 | 4.0 | 70.0 | -18.7943 | 145.8824 | 68.7 | 84.6 | 20.9 | 8.5 |
sift-128-euclidean | on_disk | Sift descriptor | Image Embeddings | 128.0 | l2 | 0.7 | 0.39 | 3.8 | 3.7 | 2.7027 | -29.927 | 35.7143 | 9.6 | 13.7 | 3.8 | 2.8 |
ADC Design
High Level Design
If ADC is enabled then the following design will allow for finding the closest neighbors of full-precision vectors against quantized document vectors.The main component of the design is a new
ADCFlatCodesDistanceComputer
to enable asymmetric searching. Our strategy is to create an altered binary index with distance calculations handled by the custom computer. The creation step does not require reindexing data, but it does require loading the preexisting faiss index. Whenever this index is queried from Java code, the hnsw graph search will use the transformed full precision vector to determine its closest neighbors via the overloaded distance computer.Indexing Changes
We use theQuantizerHelper
to calculate the above and below threshold means for ADC for the vectors in each segment. These above and below threshold means are stored in the QuantizationState
so that they can be used to transform the ADC vector. The faiss index creation and indexing steps are unchanged, as our strategy is to alter the query vectors during searching. The quantized document vectors are not changed at all; we just calculate the aboveThresholdMeans[]
and belowThresholdMeans[]
for later searches.
Search Changes
We use theaboveThresholdMeans[]
and belowThresholdMeans[]
arrays to transform each query vector.The main change to search is that we load an altered faiss index with a new distance computer that handles the asymmetric computation. The red boxes in the below diagram represent the alterations to the faiss index. The main new component is a document storage class, called
FaissIndexBQ
, that overrides the get_FlatCodesDistanceComputer
method to return the desired ADCFlatCodesDistanceComputer
. The new computer overrides the distance function in hnsw graph search. We pass the preexisting loaded id map and hnsw graph structure to the new binary index. We pass the codes vector (the set of quantized documents in our index) from the original storage to our altered storage as a pointer. We do not need to perform a vector copy, since we can assume that the original storage will remain in memory for the lifetime of the altered index.
Whenever this ADC-enabled index is searched, the new distance computer’s set_query
and distance_to_code
methods are used for asymmetric computation within the hnsw search code.
After the search concludes, the path is the same as before, with rescoring handled by preexisting code. Any rescoring is performed with the original non-transformed vector against the non-quantized document vectors. The filtering code path is not affected.
See the LLD section for a sequence diagram covering searching and caching as well as more details about the alteration strategy.
Out of Scope
The following are out of scope for this particular design:- Using ADC during indexing to build a better hnsw graph. We would need to explore this direction further to see if the recall gains are large enough to offset what will be a large increase in indexing time.
- Automatically deducing when ADC should be enabled/disabled. We elaborate on this option in the alternatives considered section, but this document does not include the
auto
design.
LLD
Faiss Index Alteration Strategy
In Faiss a binary index is implemented in thefaiss::IndexBinary
class. The faiss::IndexBinaryIDMap
, faiss::IndexBinaryHNSW
, and faiss::IndexBinaryFlat
classes encapsulate the vector to document id map, hnsw structure, and document storage respectively. When a faiss index is searched, a specific distance computer is used. The distance computer is attached to the document storage class, so we create a new document storage class FaissIndexBQ
that overrides the get_FlatCodesDistanceComputer
method to return the desired ADCFlatCodesDistanceComputer
. As mentioned in the high level design, we pass the preexisting loaded id map and hnsw graph structure to the new binary index. We pass the codes vector (representing the set of quantized documents in our index) from the original storage to our altered storage as a pointer. We do not need to perform a vector copy, since we can assume that the original storage will remain in memory for the lifetime of the altered index. Below is a more detailed diagram illustrating the index alteration strategy.br>Search sequence diagram
Transformation Formulae
1 bit:q_d = (q_d - x_d) / (y_d - x_d)
where x_d
is the mean of all document entries quantized to 0
(the below threshold mean) and y_d
is the mean of all document entries quantized to 1
(the above threshold mean).The multi bit transformation is slightly more involved. Essentially instead of splitting each dimension into a below/above threshold, it splits each dimension into multiple thresholds (for instance
00
, 01
, 11
thresholds) and scales the query vector according to these thresholds. See the appendix for a full explanation. Additional Memory Requirements
For 1 bit, per query:- Need to store (d/ 8) * 256 additional floats for the batched distance.
- Total cost: (d / 8) * 256 * 4 = 128*d bytes.
- Need to store (d / 8) * 256 * 2 additional floats for the batched distance.
- Need to store a copy of the above and below threshold means used to transform the query vector ( 2 * d floats)
- Finally need partitions (3 * d floats)
- Not strictly necessary but makes implementation easier.
- Total cost: 4 * ((d * 64) + (d * 2) + (d * 3)) = 276*d bytes.
Batching Optimization
We optimize ADC by creating a codebook and exploiting the fact that the document vectors (“codes”) are represented as contiguous bits. Due to the bit-packing logic, each contiguous byte represents 8 entries in a document code. We can thus precompute the 2^8=256 possible distances (0000 0000
, 0000 0001
,0000 0011
etc) of each batch of code entries. This allows us to find the distance to a batch of vector dimensions with an array lookup.Random Rotation Design
High Level Design
Random rotation is implemented within the quantization code flow in k-NN. The matrix is generated by the following process:- Sample each entry in the matrix from a standard normal random variable.
- Use Gram-Schmidt to orthonormalize the matrix.
M
gives us norm(Mx) = norm(x)
with probability 1. We apply our rotation matrix
M
to our full-precision document vectors before quantizing them, and also to our query vectors.Low Level Design
Train Flow
Indexing Flow
Search Flow
Open Questions
- Due to merge time regressions, should we store the rotated vectors in memory?
- We do something similar in the cosine space for faiss, where we store the normalized vectors in source.
- Rotation, like normalization, is an invertible operation. We can preserve the option of undoing this rotation by logging the entire random rotation matrix and its transpose. (the transpose of the rotation matrix is its inverse since the rotation matrix, due to its generation process, is orthogonal.) However, this approach may break if data is downloaded from opensearch and then re-indexed, since it will be rotated multiple times. This is in contrast to normalization, where normalization only affects the vectors during the first indexing step.
User Experience - Recommended Changes to k-NN Interface
We recommend giving users control over ADC+RR via an index mapping parameter. The P0 goal for 3.1 is to introduce low-level configuration changes under the encoder configuration. enable_adc
and random_rotation
can be set to true/false
to enable/disable each behavior. They are off by default to avoid unexpected latency increases.
PUT vector-index
{
"settings" : {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"vector_field": {
"type": "knn_vector",
"dimension": 8,
"method": {
"name": "hnsw",
"engine": "faiss",
"space_type": "l2",
"parameters": {
"encoder": {
"name": "binary",
"parameters": {
"bits": 1,
"random_rotation": true,
"enable_adc": true
}
}
}
}
}
}
}
}
In the future we can add a higher-level index mapping parameter named enhance_search
or boost_recall
or boost_search_recall
with boolean true/false
options. enhance_search/boost_recall
should be false
by default to prevent unexpected latency increases.
PUT /adc-index
{
"settings" : {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"vector_field": {
"type": "knn_vector",
"dimension": 768,
"space_type": "l2",
"data_type": "float",
"mode": "on_disk",
"compression_level": "32x",
// new enable adc+rr mapping parameter
"enhance_search": true
}
}
}
}
Alternatives Considered
Enable by Default
One alternative is to enable ADC+RR by default. Though this will boost recall on many quantized datasets, it comes at the cost of higher latency. If we enable search accuracy enhancements by default then users will be confused by latency degradation in their preexisting clusters. The advantage of this approach is that recall will increase. The desirable recall/latency tradeoff is specific to each use case so it doesn’t make sense to turn on this setting for all users since some users will be happy (if they want primarily higher recall) and others will be upset (if they want primarily high performance). Additionally the impact on recall and performance may be opaque.Enable by default but decrease oversampling
A large factor in on-disk search performance is the degree of oversampling/rescoring in an index. We may be able to tune the oversampling factor whenenhance_search
is enabled. Right now the oversample_factor
is set to 5.0
for dimensions < 1000 and to 3.0
for dimensions > 1000. We could do more investigation into how much we can decrease the oversample_factor
if enhance_search
is on. This approach would require a series of experiments essentially doing a grid search over all oversample factors to determine if we can find a configuration that boosts recall for all test datasets without a latency penalty. If the community believes this is a promising direction it can be pursued as a P1 task. Enable based on characteristics of index (auto mode)
We found from benchmarking that text embedding data holds up very well to binary quantization but image data (sift, gist) does not. We could take further investigation into the datasets that perform poorly and try to find characteristics of the index that we can use to toggleenhance_search
. We could track metrics based on the data distribution of the index, perhaps using the new profiler API. The current profiler metrics of mean and variance are not sufficient for ADC unfortunately. This investigation will take time and may not reveal any new insights into when ADC should be turned on/off. Fundamentally a user’s decision about whether or not to use ADC depends on if they are willing to trade search latency for search accuracy. The exact hit to search time and the boost to accuracy depend on the user’s workload. We saw in the experiments that the accuracy boost and performance cost of ADC varies significantly depending on the dataset.
Coauthors
Jack Mazanec (@jmazanec15), Yonatan Naamad (@ynaamad), Vikash Tiwari (@Vikasht34), Tal Wagner (@talwagner)Metadata
Metadata
Assignees
Labels
Type
Projects
Status