-
Notifications
You must be signed in to change notification settings - Fork 186
Description
Is your feature request related to a problem?
Multi-vector retrieval models like ColBERT and ColPali produce per-token embeddings that require MaxSim scoring across all token pairs. This is expensive at scale because there's no way to do ANN prefetch on variable-length multi-vector representations — you're forced to either brute-force score every document or rely on external tooling to pre-encode vectors client-side.
Currently, the k-NN plugin supports [lateInteractionScore](https://docs.opensearch.org/latest/query-dsl/specialized/script-score/#late-interaction-score) for MaxSim reranking, but the inner query is typically match_all or a text filter, meaning every matching document gets scored. There's no native way to narrow candidates using the multi-vector embeddings themselves.
What solution would you like?
Add two new processors implementing the MUVERA algorithm (Multi-Vector Retrieval via Fixed Dimensional Encodings, [paper](https://arxiv.org/abs/2405.19504)):
-
muveraingest processor — Converts variable-length multi-vector embeddings into a single fixed-dimensional encoding (FDE) vector using SimHash clustering and random projections. The FDE is stored in aknn_vectorfield for ANN indexing. The original multi-vectors remain in_sourcefor reranking. -
muvera_querysearch request processor — Interceptsscript_scorequeries containingquery_vectorsin script params, MUVERA-encodes them into an FDE, and replaces the innermatch_allwith aknnquery on the FDE field. ThelateInteractionScorescript wrapper stays intact for MaxSim reranking on the prefetched candidates.
User flow
Step 1: Create ingest pipeline
PUT _ingest/pipeline/muvera-ingest
{
"description": "MUVERA FDE encoding for ColBERT vectors",
"processors": [
{
"muvera": {
"source_field": "colbert_vectors",
"target_field": "muvera_fde",
"dim": 128,
"fde_dimension": 2560
}
}
]
}Defaults: k_sim=4, dim_proj=8, r_reps=20, seed=42. FDE dimension = r_reps * 2^k_sim * dim_proj = 2560. The fde_dimension parameter validates the computed value so the user explicitly acknowledges the output size.
Step 2: Create index
PUT muvera-index
{
"settings": {
"index.knn": true,
"default_pipeline": "muvera-ingest"
},
"mappings": {
"dynamic": false,
"properties": {
"muvera_fde": {
"type": "knn_vector",
"dimension": 2560,
"method": {
"name": "hnsw",
"space_type": "innerproduct",
"engine": "faiss"
}
},
"title": { "type": "text" }
}
}
}Note: colbert_vectors is intentionally left unmapped, it stays in _source for reranking but doesn't need its own field mapping.
Step 3: Index documents
POST muvera-index/_doc/1
{
"title": "example document",
"colbert_vectors": [
[0.1, 0.2, ...],
[0.3, 0.4, ...],
[0.5, 0.6, ...]
]
}The ingest processor reads colbert_vectors, produces the FDE, and writes it to muvera_fde. Both fields end up in the stored document.
Step 4: Create search pipeline
PUT _search/pipeline/muvera-search
{
"request_processors": [
{
"muvera_query": {
"target_field": "muvera_fde",
"dim": 128,
"fde_dimension": 2560,
"oversample_factor": 4
}
}
]
}Same MUVERA hyperparams as ingest (must match). oversample_factor controls how many candidates the knn prefetch retrieves relative to the requested result size.
Step 5: Search
POST muvera-index/_search?search_pipeline=muvera-search
{
"size": 10,
"query": {
"script_score": {
"query": { "match_all": {} },
"script": {
"source": "lateInteractionScore(params.query_vectors, 'colbert_vectors', params._source, params.space_type)",
"params": {
"query_vectors": [[0.1, 0.2, ...], [0.3, 0.4, ...]],
"space_type": "innerproduct"
}
}
}
}
}What happens:
- Search processor extracts
query_vectorsfrom script params - MUVERA-encodes them into a query FDE
- Replaces
match_allwithknnonmuvera_fde(k = size × oversample_factor = 40) lateInteractionScorereranks the 40 candidates using exact MaxSim on original multi-vectors- Top 10 returned to user
What alternatives have you considered?
- Client-side MUVERA encoding (works but requires users to maintain encoding logic outside OpenSearch)
- Binary quantization of multi-vectors (lossy, doesn't preserve MaxSim structure)
- Text-based prefetch with BM25 (misses semantic signal from embeddings)
Do you have any additional context?
- MUVERA is already implemented in [fastembed](https://github.com/qdrant/fastembed) (Python) and used in production with Qdrant
- We have a working implementation with unit tests, stable across multiple iterations with random seeds
- The implementation uses only public APIs — no reflection or core OpenSearch modifications required
- Tested end-to-end on a live cluster: ingest pipeline creates FDE vectors, search pipeline rewrites queries,
lateInteractionScorereranking produces correct MaxSim scores