Added hybrid document pooling to LLM judgment API, improving judgment coverage for Hybrid Optimizer experiments#400
Conversation
…oving judgment coverage for Hybrid Optimizer experiments Signed-off-by: Martin Gaievski <gaievski@amazon.com>
3033cc5 to
642a4ca
Compare
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
| String promptTemplate, | ||
| LLMJudgmentRatingType ratingType, | ||
| boolean overwriteCache, | ||
| boolean expandCoverage, |
There was a problem hiding this comment.
Can we use public static final String EXPAND_COVERAGE = "expandCoverage"; in MLConstants (next to OVERWRITE_CACHE) and use it in all three files?
other files I saw are RestPutJudgmentAction.java and PutJudgmentTransportAction.java
As every other parameter (OVERWRITE_CACHE, PROMPT_TEMPLATE...) has a named constant.
|
|
||
| log.info("Starting LLM judgment generation for {} total queries", totalQueries); | ||
|
|
||
| // Fire-and-forget cleanup of stale cache entries (older than 90 days) |
There was a problem hiding this comment.
It's mentioned as "older than 90 days" but the actual TTL is configurable.
Also, even though it's fire-and-forget and the deleteByQuery is async, doesn't this generate an unnecessary OpenSearch deleteByQuery request on every API call?
can we track the last cleanup time and skip if less than maybe 1 hour has passed or run cleanup on a periodic schedule maybe daily rather than on every request?
| searchFutures.add(future.thenAccept(response -> { | ||
| if (response.getHits().getTotalHits().value() > 0) { | ||
| for (SearchHit hit : response.getHits().getHits()) { | ||
| allHits.put(hit.getId(), hit); |
There was a problem hiding this comment.
Can we use putIfAbsent in here as well?
Description
Added expanded coverage feature to create judgment ratings with LLM. Following are main changes covered in this PR:
expandCoverageparam to create judgment API, flag is disabled by default to keep backward compatibilityAPI changes:
Results from all queries from the pool are deduplicated and merged into a single list, and that single list send to an LLM for ranking
I tested manually using slightly tweaked demo script and sub-set of ESCI dataset bundled with the repo.
I ran 4 variants to test coverage impact and whether variation in ranking is caused by anything other than LLM indeterminism:
expandCoverage(46 total ratings, +84% coverage)expandCoverageandoverwriteCache(46 total ratings, fresh LLM calls)Based on this we can conclude that expandCoverage is a pure superset: all 25 baseline docs present in 46-doc expanded set with 100% identical ratings (via cache). Rating variation (85-89% exact) is consistent across all fresh-vs-fresh and cached-vs-fresh pairs, confirming it's LLM temperature/indeterminism, not the larger document set
Issues Resolved
#401
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.