Skip to content

Added hybrid document pooling to LLM judgment API, improving judgment coverage for Hybrid Optimizer experiments#400

Open
martin-gaievski wants to merge 2 commits into
opensearch-project:feature/improved_judgment_coverage_for_hybrid_optimizerfrom
martin-gaievski:feature/improved_judgment_coverage_hs_opt
Open

Added hybrid document pooling to LLM judgment API, improving judgment coverage for Hybrid Optimizer experiments#400
martin-gaievski wants to merge 2 commits into
opensearch-project:feature/improved_judgment_coverage_for_hybrid_optimizerfrom
martin-gaievski:feature/improved_judgment_coverage_hs_opt

Conversation

@martin-gaievski
Copy link
Copy Markdown
Member

@martin-gaievski martin-gaievski commented Feb 20, 2026

Description

Added expanded coverage feature to create judgment ratings with LLM. Following are main changes covered in this PR:

  • added expandCoverage param to create judgment API, flag is disabled by default to keep backward compatibility
  • new feature works only with hybrid queries, dynamically adjust number of queries and weights depending on number of sub-queries.
  • added cleanup feature for judgment cache. It's controlled by the new dynamic setting that defines the TTL for the cache entries. Default is -1, meaning all entries are preserved indefinitely, making it opt-in feature

API changes:

PUT _plugins/_search_relevance/judgments
{
  "type": "LLM_JUDGMENT",
  "querySetId": "e02b8da2-...",
  "searchConfigurationList": ["hybrid-search-config-id"],
  "modelId": "qSX4BpwBxkQx0oQo1B4g",
  "expandCoverage": true,    // ← new: opt-in, pools 3 hybrid variants
  "size": 10
}

Results from all queries from the pool are deduplicated and merged into a single list, and that single list send to an LLM for ranking

Search 1 (equal weights):     [A, B, C, D, E]
Search 2 (keyword-only [1,0]): [A, B, F, G, H]    
Search 3 (neural-only [0,1]):  [A, C, I, J, K]
                                
allHits (putIfAbsent):         {A, B, C, D, E, F, G, H, I, J, K}   11 unique docs
                                
Cache dedup:                   remove any previously rated
                                
LLM call:                     single batch with all unique docs

I tested manually using slightly tweaked demo script and sub-set of ESCI dataset bundled with the repo.

I ran 4 variants to test coverage impact and whether variation in ranking is caused by anything other than LLM indeterminism:

  • Exp1: generate ratings for 5 queries, all parameters are unset/false (25 total ratings)
  • Exp2: with expandCoverage (46 total ratings, +84% coverage)
  • Exp3: with both expandCoverage and overwriteCache (46 total ratings, fresh LLM calls)
  • Exp4: repeat Exp3 (46 total ratings, fresh LLM calls)
Comparison Docs Exact Close (≤0.1) E+C Avg Diff
Exp1 (baseline) vs Exp2 (expandCoverage) 25/25 (100%) 25/25 (100%) 0/25 (0%) 25/25 (100%) 0.000
Exp2 (cached) vs Exp3 (fresh1) 46/46 (100%) 39/46 (85%) 4/46 (9%) 43/46 (93%) 0.026
Exp2 (cached) vs Exp4 (fresh2) 46/46 (100%) 37/46 (80%) 3/46 (7%) 40/46 (87%) 0.039
Exp3 (fresh1) vs Exp4 (fresh2) 46/46 (100%) 41/46 (89%) 0/46 (0%) 41/46 (89%) 0.026

Based on this we can conclude that expandCoverage is a pure superset: all 25 baseline docs present in 46-doc expanded set with 100% identical ratings (via cache). Rating variation (85-89% exact) is consistent across all fresh-vs-fresh and cached-vs-fresh pairs, confirming it's LLM temperature/indeterminism, not the larger document set

Issues Resolved

#401

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski added the enhancement New feature or request label Feb 20, 2026
…oving judgment coverage for Hybrid Optimizer experiments

Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski force-pushed the feature/improved_judgment_coverage_hs_opt branch from 3033cc5 to 642a4ca Compare February 20, 2026 00:42
@martin-gaievski martin-gaievski changed the title Added parameter to LLM judgment API for hybrid document pooling, improving judgment coverage for Hybrid Optimizer experiments Added hybrid document pooling to LLM judgment API, improving judgment coverage for Hybrid Optimizer experiments Feb 20, 2026
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
@martin-gaievski martin-gaievski marked this pull request as ready for review February 20, 2026 21:19
@martin-gaievski martin-gaievski changed the base branch from main to feature/improved_judgment_coverage_for_hybrid_optimizer February 23, 2026 20:46
String promptTemplate,
LLMJudgmentRatingType ratingType,
boolean overwriteCache,
boolean expandCoverage,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use public static final String EXPAND_COVERAGE = "expandCoverage"; in MLConstants (next to OVERWRITE_CACHE) and use it in all three files?
other files I saw are RestPutJudgmentAction.java and PutJudgmentTransportAction.java

As every other parameter (OVERWRITE_CACHE, PROMPT_TEMPLATE...) has a named constant.


log.info("Starting LLM judgment generation for {} total queries", totalQueries);

// Fire-and-forget cleanup of stale cache entries (older than 90 days)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mentioned as "older than 90 days" but the actual TTL is configurable.

Also, even though it's fire-and-forget and the deleteByQuery is async, doesn't this generate an unnecessary OpenSearch deleteByQuery request on every API call?
can we track the last cleanup time and skip if less than maybe 1 hour has passed or run cleanup on a periodic schedule maybe daily rather than on every request?

searchFutures.add(future.thenAccept(response -> {
if (response.getHits().getTotalHits().value() > 0) {
for (SearchHit hit : response.getHits().getHits()) {
allHits.put(hit.getId(), hit);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use putIfAbsent in here as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants