Reciprocal Rank Fusion (RRF) in TopDocs#13470
Conversation
jpountz
left a comment
There was a problem hiding this comment.
Thanks for looking into this! This looks like what I'd expect for RRF in Lucene. I left some comments, could you also add some tests?
| } | ||
| } | ||
|
|
||
| /** Reciprocal Rank Fusion method. */ |
There was a problem hiding this comment.
Users could use more details in these javadocs, e.g. what are k and topN?
There was a problem hiding this comment.
Do we have to check if the k value is greater than or equal to 1 in the code? And maybe mention it in the javadocs?
There was a problem hiding this comment.
+1 to validating parameters
|
|
||
| /** Reciprocal Rank Fusion method. */ | ||
| public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { | ||
| Map<Integer, Float> rrfScore = new HashMap<>(); |
There was a problem hiding this comment.
Note that we should identify documents not only by their doc ID, but by the combination of ScoreDoc.shardIndex and ScoreDoc.doc, as there could be multiple documents that have the same doc ID but come from different shards.
| Map<Integer, Float> rrfScore = new HashMap<>(); | ||
| long minHits = Long.MAX_VALUE; | ||
| for (TopDocs topDoc : hits) { | ||
| minHits = Math.min(minHits, topDoc.totalHits.value); |
There was a problem hiding this comment.
I wonder if it should be a max rather than a min? Presumably, hits from either top hits are considered as hits globally, and we are just using RRF to boost hits that are ranked high in multiple top hits?
There was a problem hiding this comment.
The totalHits was a tricky part that we didn't know what value to assign to. IIUC, The totalHits means all the matched Document in a query, and we couldn't really calculate the union of the totalHits for all the TopDocs. So for this min totalHits, I just wanted to assign a min totalHits temporarily, to match the totalHits relation "greater than or equal to". And I want to ask for your opinion on this.
There was a problem hiding this comment.
I agree with using GREATER_THAN_OR_EQUAL_TO all the time, but I would still take the max, because a document is a match of it is a match to either query. For instance, imagine combining top hits from two queries where one query didn't match any docs, the total hit count should be the hit count of the other query, not zero?
There was a problem hiding this comment.
Make sense to me, thank you for the detailed explanation!!
| } | ||
|
|
||
| List<Map.Entry<Integer, Float>> scoreList = new ArrayList<>(scoreMap.entrySet()); | ||
| scoreList.sort(Map.Entry.comparingByValue()); |
There was a problem hiding this comment.
We don't seem to be using these scoreMap and scoreList anywhere?
There was a problem hiding this comment.
Oops! My bad. I think we got something wrong right here. The for loop traversal for (ScoreDoc scoreDoc : topDoc.scoreDocs) is wrong, we should actually traverse the sorted results, i.e., scoreList, to add the ranking result to rrfScore.
int rank = 1;
for (Map.Entry<Integer, Float> entry : scoreList.entrySet()) {
rrfScore.put(entry.getKey(),
rrfScore.getOrDefault(entry.getKey(), 0.0f) + 1 / (rank + k));
rank++;
}
P.S. For this part, however, I think we should determine the implementation of combining ScoreDoc.doc and ScoreDoc.shardIndex together first.
| } | ||
|
|
||
| /** Reciprocal Rank Fusion method. */ | ||
| public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { |
There was a problem hiding this comment.
nit: function arguments should use lower camel case
| public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { | |
| public static TopDocs rrf(int topN, int k, TopDocs[] hits) { |
|
|
||
| int rank = 1; | ||
| for (ScoreDoc scoreDoc : topDoc.scoreDocs) { | ||
| rrfScore.put(scoreDoc.doc, rrfScore.getOrDefault(scoreDoc.doc, 0.0f) + 1.0f / (rank + k)); |
There was a problem hiding this comment.
Use Map#compute instead of getOrDefault + put?
There was a problem hiding this comment.
Sounds good! I don't know Map#compute before tho. It should be like below:
int rank = 1;
for (Map.Entry<Integer, Float> entry : scoreList.entrySet()) {
rrfScore.compute(entry.getKey(), (key, value) ->
(value == null ? 0.0f : value) + 1 / (float)(rank + k)
);
rank++;
}
|
|
||
| ScoreDoc[] rrfScoreDocs = new ScoreDoc[Math.min(TopN, rrfScoreRank.size())]; | ||
| for (int i = 0; i < rrfScoreDocs.length; i++) { | ||
| rrfScoreDocs[i] = new ScoreDoc(rrfScoreRank.get(i).getKey(), rrfScoreRank.get(i).getValue()); |
There was a problem hiding this comment.
Nit: we should preserve the original shardIndex that is configured on the ScoreDoc object that identifies the shard that it is coming from.
There was a problem hiding this comment.
So this was also a tricky part for us. For my understanding, the RRF would combine search result based on the different ranks of a documents in different results. We supposed to combine the ranks for all individual doucments, but a document come from different shards should be treated as different documents?
There was a problem hiding this comment.
This is correct. When working with shards, hits should first be combined on a per-query basis using TopDocs#merge, which will set the shardIndex field. And then global top hits can be merged across queries with RRF.
There was a problem hiding this comment.
If we do the rrf by setting the unique key as docid and shardIndex, what would be the difference between TopDocs#rrf and TopDocs#merge? I think giving an example could express better. Suppose that we have two Shards, and we want to retrieve top 3 results from each shards and do rrf on top of them. There's three documents A, B and C. In Shard1, the top 3 is A -> B -> C. In Shard2, it's B -> C -> A. The original rrf method would calculate the rank by aggregating the docid, assume the constant k is 1. Top 3 results would be B (1/(k+2) + 1/(k+1)) - A (1/(k+1) + 1/(k+3)) - C (1/(k+3) + 1/(k+2)).
If we are going to consider the shardIndex as a unique key as well, how should the rrf rank to be presented.
There was a problem hiding this comment.
When working with shards, shards managed non-overlapping subsets of the data, so you could not have documents A, B and C in both shards.
|
I'm not sure 'rrf' should be a direct method in topDocs: N.B. I am generally in favour of "You are Not Gonna Need It' approach, but in Lucene's instance we have many contributors and future contributors that may get involved, and doing this abstraction work when and if "a second strategy" gets implemented may not happen |
|
I'm not worried about this. If we feel like we should expose it differently in the future, we'll do it, deprecate this function, and remove it in Lucene 11. |
- Parameters are now validated. - The shardIndex is now taken into account to identify hits. - The total hit count is computed is the max total hit count. - Unit tests. - Tie break on doc and shardIndex, consistently with TopDocs#merge.
|
@harenlin I took some freedom to apply my feedback and push it to your branch. Would you like to take a look and check if it makes sense? |
|
I plan on merging this PR soon if there are no objections. |
|
This looks good to me. Perhaps we could mark the new static method experimental, especially if we think we are going to want to support more ways of combining topdocs soon enough. I don't have a strong opinion though, it would also be ok to introduce a more flexible way to do rrf while keeping this one around until the next major. |
|
Thanks for taking a look. I have a bias for the latter, as I was planning on improving the docs of the oal.search package as a follow-up to provide guidance wrt how to do hybrid search by linking to this RRF helper. |
| for (TopDocs topDocs : hits) { | ||
| for (ScoreDoc scoreDoc : topDocs.scoreDocs) { | ||
| shardIndexSet = scoreDoc.shardIndex != -1; | ||
| break outer; |
There was a problem hiding this comment.
Is the purpose here to only check the first scoreDoc of every TopDocs instance provided in the array? Should we try and rewrite this to be more readable and not use goto ?
Co-authored-by: SuperSonicVox <hackchang0715@gmail.com> Co-authored-by: Adrien Grand <jpountz@gmail.com>
I opened #14310. |
Co-authored-by: SuperSonicVox <hackchang0715@gmail.com> Co-authored-by: Adrien Grand <jpountz@gmail.com>
Description
Hello the community,
Hank and I just follow the discussion thread to implement the RRF function that can be used. By the way, we know that the RRF issue is under debate in solr (FYR); however, we think this new feature could still be a good one.