-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Reciprocal Rank Fusion (RRF) in TopDocs #13470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
c6da395
9a7b57b
57327aa
ce55148
4a6e10f
ccb940f
8e21bbc
c018e50
9579816
8d8f760
68b041c
99c616c
3164006
67cc61f
2af86e3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -16,7 +16,11 @@ | |||||
| */ | ||||||
| package org.apache.lucene.search; | ||||||
|
|
||||||
| import java.util.ArrayList; | ||||||
| import java.util.Comparator; | ||||||
| import java.util.HashMap; | ||||||
| import java.util.List; | ||||||
| import java.util.Map; | ||||||
| import org.apache.lucene.util.PriorityQueue; | ||||||
|
|
||||||
| /** Represents hits returned by {@link IndexSearcher#search(Query,int)}. */ | ||||||
|
|
@@ -350,4 +354,38 @@ private static TopDocs mergeAux( | |||||
| return new TopFieldDocs(totalHits, hits, sort.getSort()); | ||||||
| } | ||||||
| } | ||||||
|
|
||||||
| /** Reciprocal Rank Fusion method. */ | ||||||
| public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { | ||||||
|
||||||
| public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { | |
| public static TopDocs rrf(int topN, int k, TopDocs[] hits) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, my bad.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we should identify documents not only by their doc ID, but by the combination of ScoreDoc.shardIndex and ScoreDoc.doc, as there could be multiple documents that have the same doc ID but come from different shards.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it should be a max rather than a min? Presumably, hits from either top hits are considered as hits globally, and we are just using RRF to boost hits that are ranked high in multiple top hits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The totalHits was a tricky part that we didn't know what value to assign to. IIUC, The totalHits means all the matched Document in a query, and we couldn't really calculate the union of the totalHits for all the TopDocs. So for this min totalHits, I just wanted to assign a min totalHits temporarily, to match the totalHits relation "greater than or equal to". And I want to ask for your opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with using GREATER_THAN_OR_EQUAL_TO all the time, but I would still take the max, because a document is a match of it is a match to either query. For instance, imagine combining top hits from two queries where one query didn't match any docs, the total hit count should be the hit count of the other query, not zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense to me, thank you for the detailed explanation!!
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't seem to be using these scoreMap and scoreList anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! My bad. I think we got something wrong right here. The for loop traversal for (ScoreDoc scoreDoc : topDoc.scoreDocs) is wrong, we should actually traverse the sorted results, i.e., scoreList, to add the ranking result to rrfScore.
int rank = 1;
for (Map.Entry<Integer, Float> entry : scoreList.entrySet()) {
rrfScore.put(entry.getKey(),
rrfScore.getOrDefault(entry.getKey(), 0.0f) + 1 / (rank + k));
rank++;
}
P.S. For this part, however, I think we should determine the implementation of combining ScoreDoc.doc and ScoreDoc.shardIndex together first.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Map#compute instead of getOrDefault + put?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I don't know Map#compute before tho. It should be like below:
int rank = 1;
for (Map.Entry<Integer, Float> entry : scoreList.entrySet()) {
rrfScore.compute(entry.getKey(), (key, value) ->
(value == null ? 0.0f : value) + 1 / (float)(rank + k)
);
rank++;
}
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we should preserve the original shardIndex that is configured on the ScoreDoc object that identifies the shard that it is coming from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was also a tricky part for us. For my understanding, the RRF would combine search result based on the different ranks of a documents in different results. We supposed to combine the ranks for all individual doucments, but a document come from different shards should be treated as different documents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct. When working with shards, hits should first be combined on a per-query basis using TopDocs#merge, which will set the shardIndex field. And then global top hits can be merged across queries with RRF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do the rrf by setting the unique key as docid and shardIndex, what would be the difference between TopDocs#rrf and TopDocs#merge? I think giving an example could express better. Suppose that we have two Shards, and we want to retrieve top 3 results from each shards and do rrf on top of them. There's three documents A, B and C. In Shard1, the top 3 is A -> B -> C. In Shard2, it's B -> C -> A. The original rrf method would calculate the rank by aggregating the docid, assume the constant k is 1. Top 3 results would be B (1/(k+2) + 1/(k+1)) - A (1/(k+1) + 1/(k+3)) - C (1/(k+3) + 1/(k+2)).
If we are going to consider the shardIndex as a unique key as well, how should the rrf rank to be presented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When working with shards, shards managed non-overlapping subsets of the data, so you could not have documents A, B and C in both shards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users could use more details in these javadocs, e.g. what are
kandtopN?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to check if the k value is greater than or equal to 1 in the code? And maybe mention it in the javadocs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to validating parameters