Skip to content

[FEATURE] Add MUVERA ingest and search processors for multi-vector ANN prefetch #3163

@praveenMprasad

Description

@praveenMprasad

Is your feature request related to a problem?

Multi-vector retrieval models like ColBERT and ColPali produce per-token embeddings that require MaxSim scoring across all token pairs. This is expensive at scale because there's no way to do ANN prefetch on variable-length multi-vector representations — you're forced to either brute-force score every document or rely on external tooling to pre-encode vectors client-side.

Currently, the k-NN plugin supports [lateInteractionScore](https://docs.opensearch.org/latest/query-dsl/specialized/script-score/#late-interaction-score) for MaxSim reranking, but the inner query is typically match_all or a text filter, meaning every matching document gets scored. There's no native way to narrow candidates using the multi-vector embeddings themselves.

What solution would you like?

Add two new processors implementing the MUVERA algorithm (Multi-Vector Retrieval via Fixed Dimensional Encodings, [paper](https://arxiv.org/abs/2405.19504)):

  1. muvera ingest processor — Converts variable-length multi-vector embeddings into a single fixed-dimensional encoding (FDE) vector using SimHash clustering and random projections. The FDE is stored in a knn_vector field for ANN indexing. The original multi-vectors remain in _source for reranking.

  2. muvera_query search request processor — Intercepts script_score queries containing query_vectors in script params, MUVERA-encodes them into an FDE, and replaces the inner match_all with a knn query on the FDE field. The lateInteractionScore script wrapper stays intact for MaxSim reranking on the prefetched candidates.

User flow

Step 1: Create ingest pipeline

PUT _ingest/pipeline/muvera-ingest
{
  "description": "MUVERA FDE encoding for ColBERT vectors",
  "processors": [
    {
      "muvera": {
        "source_field": "colbert_vectors",
        "target_field": "muvera_fde",
        "dim": 128,
        "fde_dimension": 2560
      }
    }
  ]
}

Defaults: k_sim=4, dim_proj=8, r_reps=20, seed=42. FDE dimension = r_reps * 2^k_sim * dim_proj = 2560. The fde_dimension parameter validates the computed value so the user explicitly acknowledges the output size.

Step 2: Create index

PUT muvera-index
{
  "settings": {
    "index.knn": true,
    "default_pipeline": "muvera-ingest"
  },
  "mappings": {
    "dynamic": false,
    "properties": {
      "muvera_fde": {
        "type": "knn_vector",
        "dimension": 2560,
        "method": {
          "name": "hnsw",
          "space_type": "innerproduct",
          "engine": "faiss"
        }
      },
      "title": { "type": "text" }
    }
  }
}

Note: colbert_vectors is intentionally left unmapped, it stays in _source for reranking but doesn't need its own field mapping.

Step 3: Index documents

POST muvera-index/_doc/1
{
  "title": "example document",
  "colbert_vectors": [
    [0.1, 0.2, ...],
    [0.3, 0.4, ...],
    [0.5, 0.6, ...]
  ]
}

The ingest processor reads colbert_vectors, produces the FDE, and writes it to muvera_fde. Both fields end up in the stored document.

Step 4: Create search pipeline

PUT _search/pipeline/muvera-search
{
  "request_processors": [
    {
      "muvera_query": {
        "target_field": "muvera_fde",
        "dim": 128,
        "fde_dimension": 2560,
        "oversample_factor": 4
      }
    }
  ]
}

Same MUVERA hyperparams as ingest (must match). oversample_factor controls how many candidates the knn prefetch retrieves relative to the requested result size.

Step 5: Search

POST muvera-index/_search?search_pipeline=muvera-search
{
  "size": 10,
  "query": {
    "script_score": {
      "query": { "match_all": {} },
      "script": {
        "source": "lateInteractionScore(params.query_vectors, 'colbert_vectors', params._source, params.space_type)",
        "params": {
          "query_vectors": [[0.1, 0.2, ...], [0.3, 0.4, ...]],
          "space_type": "innerproduct"
        }
      }
    }
  }
}

What happens:

  1. Search processor extracts query_vectors from script params
  2. MUVERA-encodes them into a query FDE
  3. Replaces match_all with knn on muvera_fde (k = size × oversample_factor = 40)
  4. lateInteractionScore reranks the 40 candidates using exact MaxSim on original multi-vectors
  5. Top 10 returned to user

What alternatives have you considered?

  • Client-side MUVERA encoding (works but requires users to maintain encoding logic outside OpenSearch)
  • Binary quantization of multi-vectors (lossy, doesn't preserve MaxSim structure)
  • Text-based prefetch with BM25 (misses semantic signal from embeddings)

Do you have any additional context?

  • MUVERA is already implemented in [fastembed](https://github.com/qdrant/fastembed) (Python) and used in production with Qdrant
  • We have a working implementation with unit tests, stable across multiple iterations with random seeds
  • The implementation uses only public APIs — no reflection or core OpenSearch modifications required
  • Tested end-to-end on a live cluster: ingest pipeline creates FDE vectors, search pipeline rewrites queries, lateInteractionScore reranking produces correct MaxSim scores

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions