Added hybrid document pooling to LLM judgment API, improving judgment coverage for Hybrid Optimizer experiments by martin-gaievski · Pull Request #400 · opensearch-project/search-relevance

martin-gaievski · 2026-02-20T00:41:30Z

Description

Added expanded coverage feature to create judgment ratings with LLM. Following are main changes covered in this PR:

added expandCoverage param to create judgment API, flag is disabled by default to keep backward compatibility
new feature works only with hybrid queries, dynamically adjust number of queries and weights depending on number of sub-queries.
added cleanup feature for judgment cache. It's controlled by the new dynamic setting that defines the TTL for the cache entries. Default is -1, meaning all entries are preserved indefinitely, making it opt-in feature

API changes:

PUT _plugins/_search_relevance/judgments
{
  "type": "LLM_JUDGMENT",
  "querySetId": "e02b8da2-...",
  "searchConfigurationList": ["hybrid-search-config-id"],
  "modelId": "qSX4BpwBxkQx0oQo1B4g",
  "expandCoverage": true,    // ← new: opt-in, pools 3 hybrid variants
  "size": 10
}

Results from all queries from the pool are deduplicated and merged into a single list, and that single list send to an LLM for ranking

Search 1 (equal weights):     [A, B, C, D, E]
Search 2 (keyword-only [1,0]): [A, B, F, G, H]    
Search 3 (neural-only [0,1]):  [A, C, I, J, K]
                                ↓
allHits (putIfAbsent):         {A, B, C, D, E, F, G, H, I, J, K}  ← 11 unique docs
                                ↓
Cache dedup:                   remove any previously rated
                                ↓
LLM call:                     single batch with all unique docs

I tested manually using slightly tweaked demo script and sub-set of ESCI dataset bundled with the repo.

I ran 4 variants to test coverage impact and whether variation in ranking is caused by anything other than LLM indeterminism:

Exp1: generate ratings for 5 queries, all parameters are unset/false (25 total ratings)
Exp2: with expandCoverage (46 total ratings, +84% coverage)
Exp3: with both expandCoverage and overwriteCache (46 total ratings, fresh LLM calls)
Exp4: repeat Exp3 (46 total ratings, fresh LLM calls)

Comparison	Docs	Exact	Close (≤0.1)	E+C	Avg Diff
Exp1 (baseline) vs Exp2 (expandCoverage)	25/25 (100%)	25/25 (100%)	0/25 (0%)	25/25 (100%)	0.000
Exp2 (cached) vs Exp3 (fresh1)	46/46 (100%)	39/46 (85%)	4/46 (9%)	43/46 (93%)	0.026
Exp2 (cached) vs Exp4 (fresh2)	46/46 (100%)	37/46 (80%)	3/46 (7%)	40/46 (87%)	0.039
Exp3 (fresh1) vs Exp4 (fresh2)	46/46 (100%)	41/46 (89%)	0/46 (0%)	41/46 (89%)	0.026

Based on this we can conclude that expandCoverage is a pure superset: all 25 baseline docs present in 46-doc expanded set with 100% identical ratings (via cache). Rating variation (85-89% exact) is consistent across all fresh-vs-fresh and cached-vs-fresh pairs, confirming it's LLM temperature/indeterminism, not the larger document set

Issues Resolved

#401

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…oving judgment coverage for Hybrid Optimizer experiments Signed-off-by: Martin Gaievski <gaievski@amazon.com>

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

iprithv · 2026-02-28T12:37:33Z

        String promptTemplate,
        LLMJudgmentRatingType ratingType,
        boolean overwriteCache,
+        boolean expandCoverage,


Can we use public static final String EXPAND_COVERAGE = "expandCoverage"; in MLConstants (next to OVERWRITE_CACHE) and use it in all three files?
other files I saw are RestPutJudgmentAction.java and PutJudgmentTransportAction.java

As every other parameter (OVERWRITE_CACHE, PROMPT_TEMPLATE...) has a named constant.

iprithv · 2026-02-28T12:42:19Z


        log.info("Starting LLM judgment generation for {} total queries", totalQueries);

+        // Fire-and-forget cleanup of stale cache entries (older than 90 days)


It's mentioned as "older than 90 days" but the actual TTL is configurable.

Also, even though it's fire-and-forget and the deleteByQuery is async, doesn't this generate an unnecessary OpenSearch deleteByQuery request on every API call?
can we track the last cleanup time and skip if less than maybe 1 hour has passed or run cleanup on a periodic schedule maybe daily rather than on every request?

iprithv · 2026-02-28T12:48:28Z

+                searchFutures.add(future.thenAccept(response -> {
+                    if (response.getHits().getTotalHits().value() > 0) {
+                        for (SearchHit hit : response.getHits().getHits()) {
+                            allHits.put(hit.getId(), hit);


Can we use putIfAbsent in here as well?

martin-gaievski added the enhancement New feature or request label Feb 20, 2026

Added parameter to LLM judgment API for hybrid document pooling, impr…

642a4ca

…oving judgment coverage for Hybrid Optimizer experiments Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski force-pushed the feature/improved_judgment_coverage_hs_opt branch from 3033cc5 to 642a4ca Compare February 20, 2026 00:42

martin-gaievski changed the title ~~Added parameter to LLM judgment API for hybrid document pooling, improving judgment coverage for Hybrid Optimizer experiments~~ Added hybrid document pooling to LLM judgment API, improving judgment coverage for Hybrid Optimizer experiments Feb 20, 2026

Adding version check related to version upgrades workflow

1badd10

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

martin-gaievski marked this pull request as ready for review February 20, 2026 21:19

martin-gaievski requested review from epugh, fen-qin, heemin32 and wrigleyDan as code owners February 20, 2026 21:19

martin-gaievski changed the base branch from main to feature/improved_judgment_coverage_for_hybrid_optimizer February 23, 2026 20:46

iprithv reviewed Mar 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added hybrid document pooling to LLM judgment API, improving judgment coverage for Hybrid Optimizer experiments#400

Added hybrid document pooling to LLM judgment API, improving judgment coverage for Hybrid Optimizer experiments#400
martin-gaievski wants to merge 2 commits into
opensearch-project:feature/improved_judgment_coverage_for_hybrid_optimizerfrom
martin-gaievski:feature/improved_judgment_coverage_hs_opt

martin-gaievski commented Feb 20, 2026 •

edited

Loading

Uh oh!

iprithv Feb 28, 2026

Uh oh!

iprithv Feb 28, 2026

Uh oh!

iprithv Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		log.info("Starting LLM judgment generation for {} total queries", totalQueries);

		// Fire-and-forget cleanup of stale cache entries (older than 90 days)

Conversation

martin-gaievski commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues Resolved

Uh oh!

iprithv Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

iprithv Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

iprithv Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

martin-gaievski commented Feb 20, 2026 •

edited

Loading