Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce support for Reciprocal Rank Fusion (combining queries) #2489

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

alessandrobenedetti
Copy link
Contributor

@alessandrobenedetti alessandrobenedetti commented May 31, 2024

https://issues.apache.org/jira/browse/SOLR-17319

Description

Reciprocal Rank Fusion (RRF) is an algorithm that takes in input multiple ranked lists to produce a unified result set.
Examples of use cases where RRF can be used include hybrid search and multiple Knn vector queries executed concurrently.
RRF is based on the concept of reciprocal rank, which is the inverse of the rank of a document in a ranked list of search results.
The combination of search results happens taking into account the position of
the items in the original rankings, and giving higher score to items that are ranked higher in multiple lists. RRF was introduced the first time by Cormack et al. in [1].
The syntax proposed:
JSON Request

{
    "queries": {
        "lexical1": {
            "lucene": {
                "query": "id:(10^=2 OR 2^=1 OR 4^=0.5)"
            }
        },
        "lexical2": {
            "lucene": {
                "query": "id:(2^=2 OR 4^=1 OR 3^=0.5)"
            }
        }
    },
    "limit": 10,
    "fields": "[id,score]",
    "params": {
        "combiner": true,
        "combiner.upTo": 5,
        "facet": true,
        "facet.field": "id",
        "facet.mincount": 1
    }
}

[1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet and individual rank learning methods.” Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (2009)

Solution

The support has been introduced leveraging the "queries" support in the JSON request syntax.
The combination of results happen at QueryComponent, processing level.
The debug component has the responsibility of managing the explainability part.

Limitations:

  • grouping/field collapsing won't be supported in a first version
  • the distributed support is present but not super advanced: query combination happens at node level and then results are combined at shard level.

Tests

Tests have been added to cover the main use cases

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@alessandrobenedetti
Copy link
Contributor Author

As of today the solution is clean enough to be reviewed and tests are green.
The usage is intuitive and it should cover many use cases.
Pending activities:

  • add a new documentation page
  • add tests for the distributed scenario

@alessandrobenedetti alessandrobenedetti requested a review from epugh May 31, 2024 14:28
Map<Integer, Integer[]> docIdToRanks;
docIdToRanks = new HashMap<>();
for (int docID : combinedResultsDocIDs) {
docIdToRanks.put(docID, new Integer[docLists.size()]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do what? Isn't Integer.valueOf(int a) used to wrap an integer?
Can you elaborate?

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at this in detail but I'm wondering could we instead have a single 'q' query parser, that is itself a "query combiner" of its children clauses? Ultimately this would mean not needing all this special code in QueryComponent. I'm trying to make recommendations that "fit in" well with Solr's architecture; less bolted / hacked on.

CC @hossman I think you'd have good inputs on this one too

Comment on lines -55 to +67
"count(//lst[@name='explain']/*)=3",
"count(//lst[@name='explain']/*)=6",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this PR affect this assertion of an existing test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more docs in the setup of the test class, for the new debug tests I added.
I thought that changing the few lines of debug that were counting the total docs was not much of a big deal.
I can obviously change that if you think it is worth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'd need to see a before & after to truly "see" the change.

@alessandrobenedetti
Copy link
Contributor Author

I haven't looked at this in detail but I'm wondering could we instead have a single 'q' query parser, that is itself a "query combiner" of its children clauses? Ultimately this would mean not needing all this special code in QueryComponent. I'm trying to make recommendations that "fit in" well with Solr's architecture; less bolted / hacked on.

CC @hossman I think you'd have good inputs on this one too

In principle I agree, correct me if I'm wrong, your proposed solution means:

  1. a Lucene contribution to create a new query
  2. a Solr query parser to combine N queries into the new Lucene one?

I would need to think and design this possible solution, but I have no idea when this will happen as I am afraid it would take a significant amount of time. (of course anyone is happy to help).

Do you think it would make sense to start refining this contribution, bake it in so that users have this new possibility decently quickly and then iteratively replace it?

It's not because I have a strong opinion on this, I am just realistically think what could bring value from my side, in accordance with my future availabilty.
Happy to discuss!

@dsmiley
Copy link
Contributor

dsmiley commented Jun 6, 2024

Want to quickly respond to this:

a Lucene contribution to create a new query

No! The Lucene project is not the only place where subclasses of Query classes exist ;-)

I'm leary of blessing committing something architecturally unclean (in my mind; no offense) that will supposedly be fixed later. I confess I'm not going to veto this as it's not all that bad really; it reads clearly, I understand this part of it so I shouldn't complain too loudly. But personally I'm not going to approve this PR and this wont' block your merge.

There are trade-offs admittedly... a special/hacky Query as I proposed (assuming it needs to be hacky?) is its own abuse of sorts; we have another -- PostFilter that isn't really a Query. But at least those are contained to a whole abstraction implementation. Meaning, ignore it if you don't care about it. But QueryComponent is doing a lot; it's already complex... more conditions and special cases here is something we should watch out for. QueryComponent is on the critical path; a place to fight against complexity. Long ago QC was relatively clean and then query grouping came along and well, what a mess that made!

Overall I wish we made proposals on the dev list about our ideas before showing up with an implementation. I want my colleagues to do the same so peer review can influence the approach early. I love reviewing other people's ideas before time is invested in an implementation! I had an ad-hoc call with another committer this week to do this sort of thing.

I'll see if I can look closer at your PR to think more about a specific recommendation. Sigh; I'm busy though

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have spent just another 10 minutes reading this to understand what's going on; I understand it better now. I had overlooked the query loop. I don't think a new Query would be that helpul for this use-case after-all. Even if one were to exist, somewhere there would be code to detect it and loop it so where would that go... either QueryComponent or SolrIndexSearcher which is another complexity minefield of greater magnitude. It's tempting to think of a QueryComponent subclass for this use-case (and for that matter grouping).

BTW I like the query syntax, especially the JSON variant.

Sorry for any distress; +1 overall.

@alessandrobenedetti
Copy link
Contributor Author

I should have spent just another 10 minutes reading this to understand what's going on; I understand it better now. I had overlooked the query loop. I don't think a new Query would be that helpul for this use-case after-all. Even if one were to exist, somewhere there would be code to detect it and loop it so where would that go... either QueryComponent or SolrIndexSearcher which is another complexity minefield of greater magnitude. It's tempting to think of a QueryComponent subclass for this use-case (and for that matter grouping).

BTW I like the query syntax, especially the JSON variant.

Sorry for any distress; +1 overall.

No distress David! I'm not the kind of developer that get attached to its own solutions.
If there are better ways I'm always open to suggestions.

I also agree in principle with proposing ideas first, before spending days inplementing them, the only reason that I don't do that often is because I am not full time on Solr contributions right now, actually I end up at the moment with limited time to do them, so I prefer to produce something that could be valuable and iterate.
Having to wait for other volunteers time to discuss the idea could end up in exhausting the time I had allocated for the contrib and have to wait for the next availability window between clients.
Worst case scenario we end up with a discussion and we may ignore the Pull Request.

I remember at the beginning I tried query parsing and other approaches but the cleanest I got was the loop in the query component.

And I absolutely agree that component is messy, so after Berlin I'll spend a bit more time polishing the changes, and if in the meantime we come up with better ideas, the better!

No rush to merge, but at the same time if we get a decent consensus I would love to have this in as this feature will open many doors!

Thanks for the time you already spent on this, it's always very much appreciated!

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could still imagine a query parser being used to express the inputs, and might map to users understanding of things. Users know query parsers, know they can be rich and form a tree, know it matches documents. To the user, maybe this is a special query even though from an implementation perspective, it wouldn't really be implemented as such because the clauses needed to each be evaluated independently. The approach would get more logic out of QueryComponent (the feature parsing aspect). It would add more net code, however, albeit not much. It might better encapsulate the feature, even though the QueryComponent is an integral piece of what implements it. Heck, we could think of this as giving the QueryComponent to a special Query if we want, thus enabling interesting use cases of queries that can be in charge of producing their document list and other stuff.


String COMBINER_ALGORITHM = COMBINER + ".algorithm";

String COMBINER_KEYS = COMBINER + ".keys";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of instead do combine.q multi-valued, so just supply each query as another value? One less indirection and thing to name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmmm, I do believe it can be useful to have a "query key", especially because the debug component explains both the query parsing and combination:

0.032522473 =
1/(60+1) + 1/(60+2) because its ranks were: 1 for query(lexical), 2 for query(lexical2)

So the keys help in my opinion.
If you were referring to this specific parameter, this is not intended for the user, it gets used automatically by Solr.

Happy to elaborate!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... combine.q multi-valued ...

In the case of multi-valued parameters, is ordering 'guaranteed' or could &combine.q=Foo&combine.q=Bar turn into &combine.q=Bar&combine.q=Foo and if it did would it matter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK it's guaranteed; we use a List not a Set intentionally

@dsmiley
Copy link
Contributor

dsmiley commented Jun 16, 2024

I did the work of migrating to a QParser but don't have permissions to push to your PR branch.

Beware, I'm reverting most/all of the files that you modified of existing Solr source files. If you would prefer I submit a separate PR, I'll do that. Instead of combiner=true, you'd do q={!combine}.

Wouldn't the docSet (assuming it's needed, like for faceting) be computed on each subQuery, and if so how is that rolled up to the final QueryResult? I see you tested that faceting works but I suspect there may be a very sad performance bug here.

@alessandrobenedetti
Copy link
Contributor Author

I did the work of migrating to a QParser but don't have permissions to push to your PR branch.

Beware, I'm reverting most/all of the files that you modified of existing Solr source files. If you would prefer I submit a separate PR, I'll do that. Instead of combiner=true, you'd do q={!combine}.
I don't remember if this is the best way to invite you as a collaborator but I just added you on our repo, hopefully that works!

I wasn't entirely sure a Query Parser is the right place for this, given we are not really building a "new query" but then I thought : "hey we have the boolean query parser, that more or less does the same", so I have nothing against it!

You can go ahead and add your commit, I don't care much if you overwrote many of my changes, I just want a nice and clean solution, however we come up to it! Worst case scenario we go back in commits and re-use what's necessary!

Wouldn't the docSet (assuming it's needed, like for faceting) be computed on each subQuery, and if so how is that rolled up to the final QueryResult? I see you tested that faceting works but I suspect there may be a very sad performance bug here.

In regards to the docSet It was rolled up in:
SortedIntDocSet docSet = new SortedIntDocSet(combinedResultsDocIds, combinedResultsLength); combinedRankedList.setDocSet(docSet);

combinedResultsDocIds is the merged array of doc Ids.
I admit I haven't spent the required time on this part and it was quickly done as a draft, so feel free to recommend a different way/correct me!

Revert DebugComponent, ResponseBuilder, SolrPluginUtils
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commit here is itself a draft but significantly reduces the impact to existing Solr source files. Combiner tests pass.

I didn't tend to the debug-ability aspect very much. Just a draft :-)

Comment on lines +1692 to +1695
if (cmd.getQuery() instanceof SelfExecutingQuery) {
// TODO QueryResult ought not to be created and passed in methods of QueryComponent
result = ((SelfExecutingQuery) cmd.getQuery()).search(searcher, cmd);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where the minimal essential bit to alter QueryComponent by adding a new abstraction (SelfExecutingQuery) that might be useful for other features; not just "combining".

arr = true;
isQuery = true;
convertJsonPropertyToLocalParams(
newMap, jsonQueryConverter, queryJsonProperty, out, isQuery, arr);
}
newMap.put(COMBINER_KEYS, queryKeys.toArray(new String[0])); // nocommit a hack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making RequestUtils add a param specific to a niche feature is a hack. If we really need this here (?), I think we should use a generic name like "json.queries".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a generic name
json.queries looks all right to me

public void search(QueryResult qr, QueryCommand cmd) throws IOException {
try {
getDocListC(qr, cmd);
} catch (FuzzyTermsEnum.FuzzyTermsException e) { // unsure where best to catch this; shrug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the FuzzyTermsException catch block from QueryComponent to here. I wanted it in a deeper place that would be caught by multiple inward code paths instead of just QC. It still feels it ought to be deeper but I didn't look for a better spot. CombineQuery calls into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all right to me, but I don't have a strong opinion

import org.apache.solr.search.SolrIndexSearcher;
import org.apache.solr.search.SyntaxError;

public class CombineQParserPlugin extends QParserPlugin {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This source file captures the essential aspect of my proposal. Needs another iteration of refinement at least (e.g. javadocs, and param naming / defaulting). Maybe more.

}
}

static final class CombineQuery extends ExtendedQueryBase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "SelfExecutingQuery" isn't normal; Lucene isn't going to work with this thing since it doesn't implement getWeight nor can it. But users don't need to know/care.

results[i] = qr;
}

// nocommit but how is the docSet (e.g. for faceting) or maybe other things supported?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a fundamental limitation in the whole feature honestly. What Hossman said about re-rank query is interesting... not sure if it addresses this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm probably I have lost in the comments why the way I created the docSet was incorrect. Can you elaborate? (a nice occasion to learn how that docSet part works, as I didn't have time yet)

Comment on lines -55 to +67
"count(//lst[@name='explain']/*)=3",
"count(//lst[@name='explain']/*)=6",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'd need to see a before & after to truly "see" the change.

Copy link
Contributor Author

@alessandrobenedetti alessandrobenedetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your changes David, they are less intrusive than my original version!
let me know once you do an additional refinement and I'll review them again

@@ -403,8 +402,9 @@ public void process(ResponseBuilder rb) throws IOException {

req.getContext().put(SolrIndexSearcher.STATS_SOURCE, statsCache.get(req));

// TODO QueryResult ought not to be created and passed in methods of QueryComponent.
// Should be created ~exclusively in SolrIndexSearcher and returned by it.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to not be a huge fan of comments in the code, what does this mean? it's just for your convenience in the draft?
If that's the case, ignore my comment

arr = true;
isQuery = true;
convertJsonPropertyToLocalParams(
newMap, jsonQueryConverter, queryJsonProperty, out, isQuery, arr);
}
newMap.put(COMBINER_KEYS, queryKeys.toArray(new String[0])); // nocommit a hack
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a generic name
json.queries looks all right to me

public void search(QueryResult qr, QueryCommand cmd) throws IOException {
try {
getDocListC(qr, cmd);
} catch (FuzzyTermsEnum.FuzzyTermsException e) { // unsure where best to catch this; shrug
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all right to me, but I don't have a strong opinion

@Override
public Query parse() throws SyntaxError {

var queriesToCombineKeys = localParams.getParams("keys");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of var as in my opinion it reduces readability, but if it's ok as stadnard in Solr, no strong opinion

}

@Override
public void addDebugInfo(NamedList<Object> dbg) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 'addQueryDebugInfo' ?

results[i] = qr;
}

// nocommit but how is the docSet (e.g. for faceting) or maybe other things supported?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm probably I have lost in the comments why the way I created the docSet was incorrect. Can you elaborate? (a nice occasion to learn how that docSet part works, as I didn't have time yet)

"inStock_b1",
"false"));
assertU(
adoc("id", "6", "title", "Gohan is the first saiyan-human hybrid.", "inStock_b1", "false"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you can see @dsmiley here I added more docs, that's the reason they are now 6


String COMBINER_KEYS = COMBINER + ".keys";

String RECIPROCAl_RANK_FUSION = "rrf";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
String RECIPROCAl_RANK_FUSION = "rrf";
String RECIPROCAL_RANK_FUSION = "rrf";

var blendParams = SolrParams.wrapDefaults(localParams, params);
queriesCombiningStrategy = QueriesCombiner.getImplementation(blendParams);

String defType = blendParams.get(QueryParsing.DEFTYPE, DEFAULT_QTYPE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe could support different defType for different keys, though bit of a niche thing that perhaps.

}
combinedRankedList.setSegmentTerminatedEarly(segmentTerminatedEarly);

combinedRankedList.setNextCursorMark(rankedLists[0].getNextCursorMark());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about picking the first list here and/or what does use of a cursor mark mean when combining queries. Alternative might be to just disallow cursor marks when combining queries.

@dsmiley
Copy link
Contributor

dsmiley commented Jul 23, 2024

@cpoerschke see the parent JIRA issue for the real state of this controvertial PR

@cpoerschke
Copy link
Contributor

@cpoerschke see the parent JIRA issue for the real state of this controvertial PR

Thanks for the pointer! It inspired me to do some #2597 scribbling.

Copy link

This PR has had no activity for 60 days and is now labeled as stale. Any new activity or converting it to draft will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the [email protected] mailing list. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Sep 26, 2024
@github-actions github-actions bot removed the stale PR not updated in 60 days label Nov 22, 2024
Copy link

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the [email protected] mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants