Skip to content

[FEATURE][RFC] Asymmetric Distance Calculation and Random Rotation to Boost Recall for On-Disk Vector Search. #2714

@finnroblin

Description

@finnroblin

Is your feature request related to a problem?
Disk-based vector search was added in OpenSearch 2.17. This feature uses binary quantization to decrease the memory footprint of vector search while keeping recall high. While disk-based vector search is successful at keeping recall high for some datasets, other data sets have recall less than 0.9 out of the box. We can use the two techniques discussed in this RFC, Asymmetric Distance Computation (ADC) and Random Rotation (RR), to increase the recall on some of these datasets.

In order to maintain high recall, disk-based search uses two phases. The first phase searches a binary quantized index for an oversampled number of neighbors. The second phase rescores and reduces these oversampled neighbors using exact search.

Image

Currently, the first phase quantizes each 4-byte floating point entry in a vector into a single bit, then computes its distance to neighbors using hamming distance.

While this approach achieves high recall on some datasets, it does not perform well on others.

We can see that for some datasets, the recall for on-disk (32x compression) is above 0.9 out of the box:

Data set Recall@k
minillm-msmarco 0.90
mpnet-msmarco 0.92
mxbai-msmarco 0.95
snowflake-msmarco 0.92

However, the success of this approach is dataset dependent. There are data sets that are well below 0.9 out of the box:

Data set Recall@k
sift 0.39
glove 0.42
gist 0.12
cohere-v2-wiki 0.57
cohere-v3-bioasq 0.71
tasb-msmarco 0.87
e5small-msmarco 0.80
clip-flickr-image-queries 0.81
cohere-v2-dbpedia 0.76

In general, recall can be improved by increasing number of documents we rescore (via increasing oversample_factor). However, because rescoring has a random access pattern to disk, more rescoring leads to slower latencies. So recall is improved, but the recall/search latency tradeoff is not.

What solution would you like?

Reduce the impact of quantization errors through combining two separate optimizations:

  1. Asymmetric Distance Computations (ADC)
  2. Random Rotation (RR)

I am opening as a single RFC instead of several because the recall benefits of ADC+RR are much higher than the recall benefits of ADC or RR separately. There is also a search latency tradeoff for these techniques.

Asymmetric Distance Computation (ADC)

Binary quantization introduces error in two different stages: 1), when the document vectors are quantized during ingestion, and 2), when the query vectors are quantized during search.

While stage 1) is necessary to achieve memory savings due to the number of documents, stage 2) is less necessary for reducing memory requirements. For instance, a full-precision 768-dimension float32 vector will use around 3 kB memory, but the same binary-quantized vector will use 100 B memory. While this is a 32x difference, the scale is small compared to the ~3 gB required to index 1M binary quantized documents (the added overhead from 3 kB vs 100B is not significant).

We can use ADC to preserve the full-precision vectors during search. This cuts down the error introduced in stage 2).

At a very high level, ADC scales our query vector close to 0 and 1. Each entry q_d in the query vector is modified via: q_d = (q_d - x_d) / (y_d - x_d). x_d is the mean of all document entries quantized to 0 (the below threshold mean) and y_d is the mean of all document entries quantized to 1 (the above threshold mean). Intuitively, the query is transformed through a linear interpolation of the above and below threshold means. This implies that an entry in our query vector that is very close to the below threshold mean will be transformed to a value near 0 and an entry close to the above threshold mean will be close to 1. However, this scheme allows for values besides just 0 and 1 which is where the better accuracy comes from.

Random Rotation (RR)

One fact about binary quantization (a type of scalar quantization) is that each entry is given equal weight during the quantization process. This leads to undesirable effects if the scale of variance is different in different dimensions. For illustration imagine we have a 2-dimensional data distribution where the first dimension (x axis) has small variance and the second dimension (y axis) has high variance. The position of a vector in the second dimension may have more “information“ than its position in the first dimension, but binary quantization treats each dimension as equally important. However, we can rotate the distribution and some of the variance is ”smoothed“ from the second dimension into the first dimension. See below for an illustration.

Image

Random rotation can help smooth uneven variance across dimensions, which improves the effectiveness of k-NN search under binary quantization, since binary quantization treats all dimensions equally.

What alternatives have you considered?

  • Increase oversampling factor to achieve 0.9 recall out of box for the above datasets.
    The issue with this alternative is that it is naive and does not address the fundamental sources of error introduced in binary quantization (it does not account for uneven information across dimensions and still introduces quantization error in two stages). It will also increase the amount of random disk IO for fetching the full-precision vectors. I believe, based on the fact that we are achieving >0.9 recall on some datasets and <0.9 on others out of the box, that solely increasing oversampling factor is not the correct next step to improve the recall for disk-based vector search.

Benchmark Results

Benchmark Setup

Three sets of experiments were performed on the several datasets we've used to benchmark disk-based vector search in the past:
  1. Random rotation (RR).
  2. Asymmetric distance calculation (ADC).
  3. Asymmetric distance calculation and random rotation (ADC+RR).
The hyperparameters used for these benchmarks are the k-NN defaults. The relevant defaults are:
  • m=16
  • ef_construction=256
  • ef_search=256
  • oversample_factor = 5.0 if dimensions < 1000 else 3.0

Note that these experiments were conducted on a single shard/single segment one-machine cluster. We expect recall to be higher for multiple segments and multiple shard setups since there will be more hnsw graphs constructed in those cases (as opposed to a single hnsw graph).

Recall Changes

The following plots show recall differences between baseline k-NN and enhanced k-NN.

From left to right: Random rotation vs baseline, ADC vs baseline, ADC&Random Rotation vs baseline. (click to enlarge).

Image


ADC+RR leads to good recall gains for some datasets, especially for datasets with a low recall (a high number of false positives). However, ADC+RR does not show improvements across the board and there are some datasets with small regressions.

Here is an enlarged recall plot for (ADC+RR):

Image

Latency impact

There is variation in search latency numbers from run-to-run but the general trend is meaningful. ADC increases search latency.

The average search latency impact across the above datasets is 17.6%.

Image

Indexing Impact

The random rotation technique adds some additional overhead during indexing. When we quantize the vectors we first rotate them. The impact on p90 indexing latencies is mostly negligible.

Image

The main performance hit comes when we force merge segments into a single segment. The reason for this is that all vectors must be rotated before being sent to the faiss index. We do not store an extra copy of the rotated vectors due to the memory overhead and engineering complexity. One optimization we could look into implementing is to store rotated vectors in the source, then undo the rotation before returning the vectors. Alternatively we could simply store only the rotated vectors (if the user only cares about the document ids of the vectors and not the vectors themselves). Already k-NN does this something similar to support cosine similarity in faiss (store normalized vectors so IP and cosine similarity are equivalent).

Image

Dataset Mode Model Model type dimension Space type Recall@100 (Enhanced) Recall@100 (Baseline) p90 query latency (Enhanced) p90 query latency (Baseline) P90 Latency Changes % p90 indexing % change Force merge time % change p90 indexing (Enhanced) p90 indexing (Baseline) Force merge time (Enhanced) Force merge time (Baseline)
coherev2-10m on_disk Cohere v2 Text Embeddings 768.0 inner product 0.52 0.57 54.7 60.2 -9.1362 -13.4746 177.7311 102.1 118.0 198.3 71.4
e5_embeddings on_disk e5.small Text Embeddings 384.0 l2 0.88 0.8 25.8 19.1 35.0785 -12.6792 45.0467 115.7 132.5 77.6 53.5
flickr-image-queries on_disk Clip Image Embeddings 512.0 l2 0.93 0.81 57.3 51.7 10.8317 -6.5632 93.9394 78.3 83.8 64.0 33.0
gist-960-euclidean on_disk Gist descriptor Image Embeddings 960.0 l2 0.36 0.12 30.0 40.1 -25.187 -8.9844 412.7273 69.9 76.8 28.2 5.5
glove-200-angular on_disk Glove word embedding model Text Embeddings 200.0 cosine 0.57 0.42 4.5 4.8 -6.25 -16.1695 2.7397 53.4 63.7 7.5 7.3
marco_tasb on_disk tasb Text Embeddings 768.0 inner product 0.86 0.87 5.6 4.5 24.4444 -31.063 300.0 66.8 96.9 23.2 5.8
mbread_msmarco on_disk minillm Text Embeddings 384.0 l2 0.95 0.9 6.7 3.36 99.4048 263.8232 89.0 35.4 9.73
mini_msmarco on_disk mxbai Text Embeddings 1024.0 cosine 0.96 0.9 4.0 3.9 2.5641 -3.2872 45.614 55.9 57.8 8.3 5.7
mpnet_marco on_disk mpnet Text Embeddings 768.0 l2 0.98 0.95 6.8 4.0 70.0 -18.7943 145.8824 68.7 84.6 20.9 8.5
sift-128-euclidean on_disk Sift descriptor Image Embeddings 128.0 l2 0.7 0.39 3.8 3.7 2.7027 -29.927 35.7143 9.6 13.7 3.8 2.8


ADC Design

High Level Design

If ADC is enabled then the following design will allow for finding the closest neighbors of full-precision vectors against quantized document vectors.

The main component of the design is a new ADCFlatCodesDistanceComputer to enable asymmetric searching. Our strategy is to create an altered binary index with distance calculations handled by the custom computer. The creation step does not require reindexing data, but it does require loading the preexisting faiss index. Whenever this index is queried from Java code, the hnsw graph search will use the transformed full precision vector to determine its closest neighbors via the overloaded distance computer.

Indexing Changes

We use the QuantizerHelper to calculate the above and below threshold means for ADC for the vectors in each segment. These above and below threshold means are stored in the QuantizationState so that they can be used to transform the ADC vector. The faiss index creation and indexing steps are unchanged, as our strategy is to alter the query vectors during searching. The quantized document vectors are not changed at all; we just calculate the aboveThresholdMeans[] and belowThresholdMeans[] for later searches.

Image

Search Changes

We use the aboveThresholdMeans[] and belowThresholdMeans[] arrays to transform each query vector.
The main change to search is that we load an altered faiss index with a new distance computer that handles the asymmetric computation. The red boxes in the below diagram represent the alterations to the faiss index. The main new component is a document storage class, called FaissIndexBQ, that overrides the get_FlatCodesDistanceComputer method to return the desired ADCFlatCodesDistanceComputer. The new computer overrides the distance function in hnsw graph search. We pass the preexisting loaded id map and hnsw graph structure to the new binary index. We pass the codes vector (the set of quantized documents in our index) from the original storage to our altered storage as a pointer. We do not need to perform a vector copy, since we can assume that the original storage will remain in memory for the lifetime of the altered index.

Image

Whenever this ADC-enabled index is searched, the new distance computer’s set_query and distance_to_code methods are used for asymmetric computation within the hnsw search code.

After the search concludes, the path is the same as before, with rescoring handled by preexisting code. Any rescoring is performed with the original non-transformed vector against the non-quantized document vectors. The filtering code path is not affected.

See the LLD section for a sequence diagram covering searching and caching as well as more details about the alteration strategy.

Out of Scope

The following are out of scope for this particular design:
  • Using ADC during indexing to build a better hnsw graph. We would need to explore this direction further to see if the recall gains are large enough to offset what will be a large increase in indexing time.
  • Automatically deducing when ADC should be enabled/disabled. We elaborate on this option in the alternatives considered section, but this document does not include the auto design.

LLD

Faiss Index Alteration Strategy

In Faiss a binary index is implemented in the faiss::IndexBinary class. The faiss::IndexBinaryIDMap , faiss::IndexBinaryHNSW, and faiss::IndexBinaryFlat classes encapsulate the vector to document id map, hnsw structure, and document storage respectively. When a faiss index is searched, a specific distance computer is used. The distance computer is attached to the document storage class, so we create a new document storage class FaissIndexBQ that overrides the get_FlatCodesDistanceComputer method to return the desired ADCFlatCodesDistanceComputer. As mentioned in the high level design, we pass the preexisting loaded id map and hnsw graph structure to the new binary index. We pass the codes vector (representing the set of quantized documents in our index) from the original storage to our altered storage as a pointer. We do not need to perform a vector copy, since we can assume that the original storage will remain in memory for the lifetime of the altered index. Below is a more detailed diagram illustrating the index alteration strategy.br>

Image



Search sequence diagram

Image

Transformation Formulae

1 bit: q_d = (q_d - x_d) / (y_d - x_d) where x_d is the mean of all document entries quantized to 0 (the below threshold mean) and y_d is the mean of all document entries quantized to 1 (the above threshold mean).

The multi bit transformation is slightly more involved. Essentially instead of splitting each dimension into a below/above threshold, it splits each dimension into multiple thresholds (for instance 00, 01, 11 thresholds) and scales the query vector according to these thresholds. See the appendix for a full explanation.

Additional Memory Requirements

For 1 bit, per query:
  • Need to store (d/ 8) * 256 additional floats for the batched distance.
  • Total cost: (d / 8) * 256 * 4 = 128*d bytes.
For 2 bit, per query:
  • Need to store (d / 8) * 256 * 2 additional floats for the batched distance.
  • Need to store a copy of the above and below threshold means used to transform the query vector ( 2 * d floats)
  • Finally need partitions (3 * d floats)
    • Not strictly necessary but makes implementation easier.
  • Total cost: 4 * ((d * 64) + (d * 2) + (d * 3)) = 276*d bytes.

Batching Optimization

We optimize ADC by creating a codebook and exploiting the fact that the document vectors (“codes”) are represented as contiguous bits. Due to the bit-packing logic, each contiguous byte represents 8 entries in a document code. We can thus precompute the 2^8=256 possible distances (0000 0000, 0000 0001,0000 0011 etc) of each batch of code entries. This allows us to find the distance to a batch of vector dimensions with an array lookup.

Random Rotation Design

High Level Design

Random rotation is implemented within the quantization code flow in k-NN. The matrix is generated by the following process:
  1. Sample each entry in the matrix from a standard normal random variable.
  2. Use Gram-Schmidt to orthonormalize the matrix.
The resultant matrix M gives us norm(Mx) = norm(x) with probability 1.

We apply our rotation matrix M to our full-precision document vectors before quantizing them, and also to our query vectors.

Low Level Design

Train Flow

Image

Indexing Flow

Image

Search Flow

Image

Open Questions

  • Due to merge time regressions, should we store the rotated vectors in memory?
    • We do something similar in the cosine space for faiss, where we store the normalized vectors in source.
    • Rotation, like normalization, is an invertible operation. We can preserve the option of undoing this rotation by logging the entire random rotation matrix and its transpose. (the transpose of the rotation matrix is its inverse since the rotation matrix, due to its generation process, is orthogonal.) However, this approach may break if data is downloaded from opensearch and then re-indexed, since it will be rotated multiple times. This is in contrast to normalization, where normalization only affects the vectors during the first indexing step.

User Experience - Recommended Changes to k-NN Interface

We recommend giving users control over ADC+RR via an index mapping parameter. The P0 goal for 3.1 is to introduce low-level configuration changes under the encoder configuration. enable_adc and random_rotation can be set to true/false to enable/disable each behavior. They are off by default to avoid unexpected latency increases.

PUT vector-index
{
  "settings" : {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "method": {
            "name": "hnsw",
            "engine": "faiss",
            "space_type": "l2",
            "parameters": {
              "encoder": {
                "name": "binary",
                "parameters": {
                  "bits": 1,
		  "random_rotation": true,
		  "enable_adc": true
                }
              }
            }
        }
      }
    }
  }
}

In the future we can add a higher-level index mapping parameter named enhance_search or boost_recall or boost_search_recall with boolean true/false options. enhance_search/boost_recall should be false by default to prevent unexpected latency increases.

PUT /adc-index
{
"settings" : {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"vector_field": {
"type": "knn_vector",
"dimension": 768,
"space_type": "l2",
"data_type": "float",
"mode": "on_disk",
"compression_level": "32x",
// new enable adc+rr mapping parameter
"enhance_search": true
}
}
}
}

Alternatives Considered

Enable by Default

One alternative is to enable ADC+RR by default. Though this will boost recall on many quantized datasets, it comes at the cost of higher latency. If we enable search accuracy enhancements by default then users will be confused by latency degradation in their preexisting clusters. The advantage of this approach is that recall will increase. The desirable recall/latency tradeoff is specific to each use case so it doesn’t make sense to turn on this setting for all users since some users will be happy (if they want primarily higher recall) and others will be upset (if they want primarily high performance). Additionally the impact on recall and performance may be opaque.

Enable by default but decrease oversampling

A large factor in on-disk search performance is the degree of oversampling/rescoring in an index. We may be able to tune the oversampling factor when enhance_search is enabled. Right now the oversample_factor is set to 5.0 for dimensions < 1000 and to 3.0 for dimensions > 1000. We could do more investigation into how much we can decrease the oversample_factor if enhance_search is on. This approach would require a series of experiments essentially doing a grid search over all oversample factors to determine if we can find a configuration that boosts recall for all test datasets without a latency penalty. If the community believes this is a promising direction it can be pursued as a P1 task.

Enable based on characteristics of index (auto mode)

We found from benchmarking that text embedding data holds up very well to binary quantization but image data (sift, gist) does not. We could take further investigation into the datasets that perform poorly and try to find characteristics of the index that we can use to toggle enhance_search. We could track metrics based on the data distribution of the index, perhaps using the new profiler API. The current profiler metrics of mean and variance are not sufficient for ADC unfortunately. This investigation will take time and may not reveal any new insights into when ADC should be turned on/off. Fundamentally a user’s decision about whether or not to use ADC depends on if they are willing to trade search latency for search accuracy. The exact hit to search time and the boost to accuracy depend on the user’s workload. We saw in the experiments that the accuracy boost and performance cost of ADC varies significantly depending on the dataset.

Coauthors

Jack Mazanec (@jmazanec15), Yonatan Naamad (@ynaamad), Vikash Tiwari (@Vikasht34), Tal Wagner (@talwagner)

Metadata

Metadata

Assignees

Type

No type

Projects

Status

3.x

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions