-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce support for Reciprocal Rank Fusion (combining queries) #2489
base: main
Are you sure you want to change the base?
Conversation
As of today the solution is clean enough to be reviewed and tests are green.
|
Map<Integer, Integer[]> docIdToRanks; | ||
docIdToRanks = new HashMap<>(); | ||
for (int docID : combinedResultsDocIDs) { | ||
docIdToRanks.put(docID, new Integer[docLists.size()]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do what? Isn't Integer.valueOf(int a) used to wrap an integer?
Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at this in detail but I'm wondering could we instead have a single 'q' query parser, that is itself a "query combiner" of its children clauses? Ultimately this would mean not needing all this special code in QueryComponent. I'm trying to make recommendations that "fit in" well with Solr's architecture; less bolted / hacked on.
CC @hossman I think you'd have good inputs on this one too
"count(//lst[@name='explain']/*)=3", | ||
"count(//lst[@name='explain']/*)=6", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this PR affect this assertion of an existing test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more docs in the setup of the test class, for the new debug tests I added.
I thought that changing the few lines of debug that were counting the total docs was not much of a big deal.
I can obviously change that if you think it is worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I'd need to see a before & after to truly "see" the change.
solr/core/src/java/org/apache/solr/request/json/RequestUtil.java
Outdated
Show resolved
Hide resolved
In principle I agree, correct me if I'm wrong, your proposed solution means:
I would need to think and design this possible solution, but I have no idea when this will happen as I am afraid it would take a significant amount of time. (of course anyone is happy to help). Do you think it would make sense to start refining this contribution, bake it in so that users have this new possibility decently quickly and then iteratively replace it? It's not because I have a strong opinion on this, I am just realistically think what could bring value from my side, in accordance with my future availabilty. |
Want to quickly respond to this:
No! The Lucene project is not the only place where subclasses of Query classes exist ;-) I'm leary of blessing committing something architecturally unclean (in my mind; no offense) that will supposedly be fixed later. I confess I'm not going to veto this as it's not all that bad really; it reads clearly, I understand this part of it so I shouldn't complain too loudly. But personally I'm not going to approve this PR and this wont' block your merge. There are trade-offs admittedly... a special/hacky Query as I proposed (assuming it needs to be hacky?) is its own abuse of sorts; we have another -- PostFilter that isn't really a Query. But at least those are contained to a whole abstraction implementation. Meaning, ignore it if you don't care about it. But QueryComponent is doing a lot; it's already complex... more conditions and special cases here is something we should watch out for. QueryComponent is on the critical path; a place to fight against complexity. Long ago QC was relatively clean and then query grouping came along and well, what a mess that made! Overall I wish we made proposals on the dev list about our ideas before showing up with an implementation. I want my colleagues to do the same so peer review can influence the approach early. I love reviewing other people's ideas before time is invested in an implementation! I had an ad-hoc call with another committer this week to do this sort of thing. I'll see if I can look closer at your PR to think more about a specific recommendation. Sigh; I'm busy though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have spent just another 10 minutes reading this to understand what's going on; I understand it better now. I had overlooked the query loop. I don't think a new Query would be that helpul for this use-case after-all. Even if one were to exist, somewhere there would be code to detect it and loop it so where would that go... either QueryComponent or SolrIndexSearcher which is another complexity minefield of greater magnitude. It's tempting to think of a QueryComponent subclass for this use-case (and for that matter grouping).
BTW I like the query syntax, especially the JSON variant.
Sorry for any distress; +1 overall.
No distress David! I'm not the kind of developer that get attached to its own solutions. I also agree in principle with proposing ideas first, before spending days inplementing them, the only reason that I don't do that often is because I am not full time on Solr contributions right now, actually I end up at the moment with limited time to do them, so I prefer to produce something that could be valuable and iterate. I remember at the beginning I tried query parsing and other approaches but the cleanest I got was the loop in the query component. And I absolutely agree that component is messy, so after Berlin I'll spend a bit more time polishing the changes, and if in the meantime we come up with better ideas, the better! No rush to merge, but at the same time if we get a decent consensus I would love to have this in as this feature will open many doors! Thanks for the time you already spent on this, it's always very much appreciated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could still imagine a query parser being used to express the inputs, and might map to users understanding of things. Users know query parsers, know they can be rich and form a tree, know it matches documents. To the user, maybe this is a special query even though from an implementation perspective, it wouldn't really be implemented as such because the clauses needed to each be evaluated independently. The approach would get more logic out of QueryComponent (the feature parsing aspect). It would add more net code, however, albeit not much. It might better encapsulate the feature, even though the QueryComponent is an integral piece of what implements it. Heck, we could think of this as giving the QueryComponent to a special Query if we want, thus enabling interesting use cases of queries that can be in charge of producing their document list and other stuff.
solr/core/src/java/org/apache/solr/search/combining/QueriesCombiner.java
Outdated
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/search/combining/QueriesCombiner.java
Show resolved
Hide resolved
solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java
Outdated
Show resolved
Hide resolved
|
||
String COMBINER_ALGORITHM = COMBINER + ".algorithm"; | ||
|
||
String COMBINER_KEYS = COMBINER + ".keys"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT of instead do combine.q
multi-valued, so just supply each query as another value? One less indirection and thing to name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmmmm, I do believe it can be useful to have a "query key", especially because the debug component explains both the query parsing and combination:
0.032522473 =
1/(60+1) + 1/(60+2) because its ranks were: 1 for query(lexical), 2 for query(lexical2)
So the keys help in my opinion.
If you were referring to this specific parameter, this is not intended for the user, it gets used automatically by Solr.
Happy to elaborate!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...
combine.q
multi-valued ...
In the case of multi-valued parameters, is ordering 'guaranteed' or could &combine.q=Foo&combine.q=Bar
turn into &combine.q=Bar&combine.q=Foo
and if it did would it matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK it's guaranteed; we use a List not a Set intentionally
Co-authored-by: David Smiley <[email protected]>
I did the work of migrating to a QParser but don't have permissions to push to your PR branch. Beware, I'm reverting most/all of the files that you modified of existing Solr source files. If you would prefer I submit a separate PR, I'll do that. Instead of Wouldn't the docSet (assuming it's needed, like for faceting) be computed on each subQuery, and if so how is that rolled up to the final QueryResult? I see you tested that faceting works but I suspect there may be a very sad performance bug here. |
I wasn't entirely sure a Query Parser is the right place for this, given we are not really building a "new query" but then I thought : "hey we have the boolean query parser, that more or less does the same", so I have nothing against it! You can go ahead and add your commit, I don't care much if you overwrote many of my changes, I just want a nice and clean solution, however we come up to it! Worst case scenario we go back in commits and re-use what's necessary!
In regards to the docSet It was rolled up in: combinedResultsDocIds is the merged array of doc Ids. |
Revert DebugComponent, ResponseBuilder, SolrPluginUtils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My commit here is itself a draft but significantly reduces the impact to existing Solr source files. Combiner tests pass.
I didn't tend to the debug-ability aspect very much. Just a draft :-)
if (cmd.getQuery() instanceof SelfExecutingQuery) { | ||
// TODO QueryResult ought not to be created and passed in methods of QueryComponent | ||
result = ((SelfExecutingQuery) cmd.getQuery()).search(searcher, cmd); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's where the minimal essential bit to alter QueryComponent by adding a new abstraction (SelfExecutingQuery) that might be useful for other features; not just "combining".
arr = true; | ||
isQuery = true; | ||
convertJsonPropertyToLocalParams( | ||
newMap, jsonQueryConverter, queryJsonProperty, out, isQuery, arr); | ||
} | ||
newMap.put(COMBINER_KEYS, queryKeys.toArray(new String[0])); // nocommit a hack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making RequestUtils add a param specific to a niche feature is a hack. If we really need this here (?), I think we should use a generic name like "json.queries".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a generic name
json.queries looks all right to me
public void search(QueryResult qr, QueryCommand cmd) throws IOException { | ||
try { | ||
getDocListC(qr, cmd); | ||
} catch (FuzzyTermsEnum.FuzzyTermsException e) { // unsure where best to catch this; shrug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the FuzzyTermsException catch block from QueryComponent to here. I wanted it in a deeper place that would be caught by multiple inward code paths instead of just QC. It still feels it ought to be deeper but I didn't look for a better spot. CombineQuery calls into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems all right to me, but I don't have a strong opinion
import org.apache.solr.search.SolrIndexSearcher; | ||
import org.apache.solr.search.SyntaxError; | ||
|
||
public class CombineQParserPlugin extends QParserPlugin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This source file captures the essential aspect of my proposal. Needs another iteration of refinement at least (e.g. javadocs, and param naming / defaulting). Maybe more.
} | ||
} | ||
|
||
static final class CombineQuery extends ExtendedQueryBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A "SelfExecutingQuery" isn't normal; Lucene isn't going to work with this thing since it doesn't implement getWeight nor can it. But users don't need to know/care.
results[i] = qr; | ||
} | ||
|
||
// nocommit but how is the docSet (e.g. for faceting) or maybe other things supported? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a fundamental limitation in the whole feature honestly. What Hossman said about re-rank query is interesting... not sure if it addresses this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm probably I have lost in the comments why the way I created the docSet was incorrect. Can you elaborate? (a nice occasion to learn how that docSet part works, as I didn't have time yet)
"count(//lst[@name='explain']/*)=3", | ||
"count(//lst[@name='explain']/*)=6", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I'd need to see a before & after to truly "see" the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your changes David, they are less intrusive than my original version!
let me know once you do an additional refinement and I'll review them again
@@ -403,8 +402,9 @@ public void process(ResponseBuilder rb) throws IOException { | |||
|
|||
req.getContext().put(SolrIndexSearcher.STATS_SOURCE, statsCache.get(req)); | |||
|
|||
// TODO QueryResult ought not to be created and passed in methods of QueryComponent. | |||
// Should be created ~exclusively in SolrIndexSearcher and returned by it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to not be a huge fan of comments in the code, what does this mean? it's just for your convenience in the draft?
If that's the case, ignore my comment
arr = true; | ||
isQuery = true; | ||
convertJsonPropertyToLocalParams( | ||
newMap, jsonQueryConverter, queryJsonProperty, out, isQuery, arr); | ||
} | ||
newMap.put(COMBINER_KEYS, queryKeys.toArray(new String[0])); // nocommit a hack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for a generic name
json.queries looks all right to me
public void search(QueryResult qr, QueryCommand cmd) throws IOException { | ||
try { | ||
getDocListC(qr, cmd); | ||
} catch (FuzzyTermsEnum.FuzzyTermsException e) { // unsure where best to catch this; shrug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems all right to me, but I don't have a strong opinion
@Override | ||
public Query parse() throws SyntaxError { | ||
|
||
var queriesToCombineKeys = localParams.getParams("keys"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a huge fan of var as in my opinion it reduces readability, but if it's ok as stadnard in Solr, no strong opinion
} | ||
|
||
@Override | ||
public void addDebugInfo(NamedList<Object> dbg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe 'addQueryDebugInfo' ?
results[i] = qr; | ||
} | ||
|
||
// nocommit but how is the docSet (e.g. for faceting) or maybe other things supported? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm probably I have lost in the comments why the way I created the docSet was incorrect. Can you elaborate? (a nice occasion to learn how that docSet part works, as I didn't have time yet)
"inStock_b1", | ||
"false")); | ||
assertU( | ||
adoc("id", "6", "title", "Gohan is the first saiyan-human hybrid.", "inStock_b1", "false")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you can see @dsmiley here I added more docs, that's the reason they are now 6
|
||
String COMBINER_KEYS = COMBINER + ".keys"; | ||
|
||
String RECIPROCAl_RANK_FUSION = "rrf"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String RECIPROCAl_RANK_FUSION = "rrf"; | |
String RECIPROCAL_RANK_FUSION = "rrf"; |
var blendParams = SolrParams.wrapDefaults(localParams, params); | ||
queriesCombiningStrategy = QueriesCombiner.getImplementation(blendParams); | ||
|
||
String defType = blendParams.get(QueryParsing.DEFTYPE, DEFAULT_QTYPE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe could support different defType
for different keys, though bit of a niche thing that perhaps.
} | ||
combinedRankedList.setSegmentTerminatedEarly(segmentTerminatedEarly); | ||
|
||
combinedRankedList.setNextCursorMark(rankedLists[0].getNextCursorMark()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about picking the first list here and/or what does use of a cursor mark mean when combining queries. Alternative might be to just disallow cursor marks when combining queries.
@cpoerschke see the parent JIRA issue for the real state of this controvertial PR |
Thanks for the pointer! It inspired me to do some #2597 scribbling. |
This PR has had no activity for 60 days and is now labeled as stale. Any new activity or converting it to draft will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the [email protected] mailing list. Thank you for your contribution! |
This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the [email protected] mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution! |
https://issues.apache.org/jira/browse/SOLR-17319
Description
Reciprocal Rank Fusion (RRF) is an algorithm that takes in input multiple ranked lists to produce a unified result set.
Examples of use cases where RRF can be used include hybrid search and multiple Knn vector queries executed concurrently.
RRF is based on the concept of reciprocal rank, which is the inverse of the rank of a document in a ranked list of search results.
The combination of search results happens taking into account the position of
the items in the original rankings, and giving higher score to items that are ranked higher in multiple lists. RRF was introduced the first time by Cormack et al. in [1].
The syntax proposed:
JSON Request
[1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet and individual rank learning methods.” Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (2009)
Solution
The support has been introduced leveraging the "queries" support in the JSON request syntax.
The combination of results happen at QueryComponent, processing level.
The debug component has the responsibility of managing the explainability part.
Limitations:
Tests
Tests have been added to cover the main use cases
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.