Reciprocal Rank Fusion (RRF) in TopDocs by harenlin · Pull Request #13470 · apache/lucene

harenlin · 2024-06-07T18:45:25Z

Description

Hello the community,

Hank and I just follow the discussion thread to implement the RRF function that can be used. By the way, we know that the RRF issue is under debate in solr (FYR); however, we think this new feature could still be a good one.

jpountz

Thanks for looking into this! This looks like what I'd expect for RRF in Lucene. I left some comments, could you also add some tests?

jpountz · 2024-06-08T09:09:33Z

lucene/core/src/java/org/apache/lucene/search/TopDocs.java

    }
  }
+
+  /** Reciprocal Rank Fusion method. */


Users could use more details in these javadocs, e.g. what are k and topN?

Do we have to check if the k value is greater than or equal to 1 in the code? And maybe mention it in the javadocs?

+1 to validating parameters

jpountz · 2024-06-08T09:10:38Z

lucene/core/src/java/org/apache/lucene/search/TopDocs.java

+
+  /** Reciprocal Rank Fusion method. */
+  public static TopDocs rrf(int TopN, int k, TopDocs[] hits) {
+    Map<Integer, Float> rrfScore = new HashMap<>();


Note that we should identify documents not only by their doc ID, but by the combination of ScoreDoc.shardIndex and ScoreDoc.doc, as there could be multiple documents that have the same doc ID but come from different shards.

jpountz · 2024-06-08T09:11:47Z

lucene/core/src/java/org/apache/lucene/search/TopDocs.java

+    Map<Integer, Float> rrfScore = new HashMap<>();
+    long minHits = Long.MAX_VALUE;
+    for (TopDocs topDoc : hits) {
+      minHits = Math.min(minHits, topDoc.totalHits.value);


I wonder if it should be a max rather than a min? Presumably, hits from either top hits are considered as hits globally, and we are just using RRF to boost hits that are ranked high in multiple top hits?

The totalHits was a tricky part that we didn't know what value to assign to. IIUC, The totalHits means all the matched Document in a query, and we couldn't really calculate the union of the totalHits for all the TopDocs. So for this min totalHits, I just wanted to assign a min totalHits temporarily, to match the totalHits relation "greater than or equal to". And I want to ask for your opinion on this.

I agree with using GREATER_THAN_OR_EQUAL_TO all the time, but I would still take the max, because a document is a match of it is a match to either query. For instance, imagine combining top hits from two queries where one query didn't match any docs, the total hit count should be the hit count of the other query, not zero?

Make sense to me, thank you for the detailed explanation!!

jpountz · 2024-06-08T09:14:10Z

lucene/core/src/java/org/apache/lucene/search/TopDocs.java

+      }
+
+      List<Map.Entry<Integer, Float>> scoreList = new ArrayList<>(scoreMap.entrySet());
+      scoreList.sort(Map.Entry.comparingByValue());


We don't seem to be using these scoreMap and scoreList anywhere?

Oops! My bad. I think we got something wrong right here. The for loop traversal for (ScoreDoc scoreDoc : topDoc.scoreDocs) is wrong, we should actually traverse the sorted results, i.e., scoreList, to add the ranking result to rrfScore.

int rank = 1; for (Map.Entry<Integer, Float> entry : scoreList.entrySet()) { rrfScore.put(entry.getKey(), rrfScore.getOrDefault(entry.getKey(), 0.0f) + 1 / (rank + k)); rank++; }

P.S. For this part, however, I think we should determine the implementation of combining ScoreDoc.doc and ScoreDoc.shardIndex together first.

jpountz · 2024-06-08T09:15:26Z

lucene/core/src/java/org/apache/lucene/search/TopDocs.java

  }
+
+  /** Reciprocal Rank Fusion method. */
+  public static TopDocs rrf(int TopN, int k, TopDocs[] hits) {


nit: function arguments should use lower camel case

Suggested change

public static TopDocs rrf(int TopN, int k, TopDocs[] hits) {

public static TopDocs rrf(int topN, int k, TopDocs[] hits) {

Oops, my bad.

jpountz · 2024-06-08T09:16:25Z

lucene/core/src/java/org/apache/lucene/search/TopDocs.java

+
+      int rank = 1;
+      for (ScoreDoc scoreDoc : topDoc.scoreDocs) {
+        rrfScore.put(scoreDoc.doc, rrfScore.getOrDefault(scoreDoc.doc, 0.0f) + 1.0f / (rank + k));


Use Map#compute instead of getOrDefault + put?

Sounds good! I don't know Map#compute before tho. It should be like below:

int rank = 1; for (Map.Entry<Integer, Float> entry : scoreList.entrySet()) { rrfScore.compute(entry.getKey(), (key, value) -> (value == null ? 0.0f : value) + 1 / (float)(rank + k) ); rank++; }

jpountz · 2024-06-08T09:18:17Z

lucene/core/src/java/org/apache/lucene/search/TopDocs.java

+
+    ScoreDoc[] rrfScoreDocs = new ScoreDoc[Math.min(TopN, rrfScoreRank.size())];
+    for (int i = 0; i < rrfScoreDocs.length; i++) {
+      rrfScoreDocs[i] = new ScoreDoc(rrfScoreRank.get(i).getKey(), rrfScoreRank.get(i).getValue());


Nit: we should preserve the original shardIndex that is configured on the ScoreDoc object that identifies the shard that it is coming from.

So this was also a tricky part for us. For my understanding, the RRF would combine search result based on the different ranks of a documents in different results. We supposed to combine the ranks for all individual doucments, but a document come from different shards should be treated as different documents?

This is correct. When working with shards, hits should first be combined on a per-query basis using TopDocs#merge, which will set the shardIndex field. And then global top hits can be merged across queries with RRF.

If we do the rrf by setting the unique key as docid and shardIndex, what would be the difference between TopDocs#rrf and TopDocs#merge? I think giving an example could express better. Suppose that we have two Shards, and we want to retrieve top 3 results from each shards and do rrf on top of them. There's three documents A, B and C. In Shard1, the top 3 is A -> B -> C. In Shard2, it's B -> C -> A. The original rrf method would calculate the rank by aggregating the docid, assume the constant k is 1. Top 3 results would be B (1/(k+2) + 1/(k+1)) - A (1/(k+1) + 1/(k+3)) - C (1/(k+3) + 1/(k+2)).
If we are going to consider the shardIndex as a unique key as well, how should the rrf rank to be presented.

When working with shards, shards managed non-overlapping subsets of the data, so you could not have documents A, B and C in both shards.

alessandrobenedetti · 2024-06-09T10:43:01Z

I'm not sure 'rrf' should be a direct method in topDocs:
Reciprocal Rank Fusion is just one way of combining result sets, if in the future we want to add other algorithms having 'rrf' there may encourage to just add and add to topDocs.
What about having a "combine" method there, potentially taking in input the combining strategy?
Then abstract the combining strategy as an interface/abstract class and implement Reciprocal Rank Fusion as the first available strategy? That should ease the process of adding more strategies and prevent TopDocs to become too dirty in the future.

N.B. I am generally in favour of "You are Not Gonna Need It' approach, but in Lucene's instance we have many contributors and future contributors that may get involved, and doing this abstraction work when and if "a second strategy" gets implemented may not happen

jpountz · 2024-06-09T12:23:06Z

I'm not worried about this. If we feel like we should expose it differently in the future, we'll do it, deprecate this function, and remove it in Lucene 11.

- Parameters are now validated. - The shardIndex is now taken into account to identify hits. - The total hit count is computed is the max total hit count. - Unit tests. - Tie break on doc and shardIndex, consistently with TopDocs#merge.

jpountz · 2025-02-14T15:35:06Z

@harenlin I took some freedom to apply my feedback and push it to your branch. Would you like to take a look and check if it makes sense?

jpountz · 2025-02-21T13:35:43Z

I plan on merging this PR soon if there are no objections.

javanna · 2025-02-24T08:55:56Z

This looks good to me. Perhaps we could mark the new static method experimental, especially if we think we are going to want to support more ways of combining topdocs soon enough. I don't have a strong opinion though, it would also be ok to introduce a more flexible way to do rrf while keeping this one around until the next major.

jpountz · 2025-02-24T08:59:07Z

Thanks for taking a look. I have a bias for the latter, as I was planning on improving the docs of the oal.search package as a follow-up to provide guidance wrt how to do hybrid search by linking to this RRF helper.

javanna · 2025-02-24T09:00:20Z

lucene/core/src/java/org/apache/lucene/search/TopDocs.java

+    for (TopDocs topDocs : hits) {
+      for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
+        shardIndexSet = scoreDoc.shardIndex != -1;
+        break outer;


Is the purpose here to only check the first scoreDoc of every TopDocs instance provided in the array? Should we try and rewrite this to be more readable and not use goto ?

Does it look better now?

yes, thank you!

lucene/core/src/java/org/apache/lucene/search/TopDocs.java

Co-authored-by: SuperSonicVox <hackchang0715@gmail.com> Co-authored-by: Adrien Grand <jpountz@gmail.com>

jpountz · 2025-02-27T21:07:49Z

I have a bias for the latter, as I was planning on improving the docs of the oal.search package as a follow-up to provide guidance wrt how to do hybrid search by linking to this RRF helper.

I opened #14310.

Co-authored-by: SuperSonicVox <hackchang0715@gmail.com> Co-authored-by: Adrien Grand <jpountz@gmail.com>

harenlin and others added 7 commits May 24, 2024 10:55

add function rrf signature

c6da395

update TopDocs

9a7b57b

fix TopDocs error

57327aa

Float -> float in TopDocs.java RRF

ce55148

fix bug in rrf

4a6e10f

Merge branch 'apache:main' into TopDocs-RRF

ccb940f

style: gradlew tidy

8e21bbc

jpountz reviewed Jun 8, 2024

View reviewed changes

feat: add javadocs on rrf method

c018e50

hack4chang and others added 3 commits June 12, 2024 09:40

fix: gradlew format

9579816

Merge branch 'main' into TopDocs-RRF

8d8f760

Fixes + tests.

68b041c

- Parameters are now validated. - The shardIndex is now taken into account to identify hits. - The total hit count is computed is the max total hit count. - Unit tests. - Tie break on doc and shardIndex, consistently with TopDocs#merge.

github-actions bot added the module:core/search label Feb 14, 2025

jpountz marked this pull request as ready for review February 21, 2025 13:31

javanna reviewed Feb 24, 2025

View reviewed changes

jpountz added 3 commits February 26, 2025 21:46

Merge branch 'main' into TopDocs-RRF

99c616c

Remove usage of break + label

3164006

CHANGES

67cc61f

javanna approved these changes Feb 27, 2025

View reviewed changes

jpountz added this to the 10.2.0 milestone Feb 27, 2025

Merge branch 'main' into TopDocs-RRF

2af86e3

jpountz merged commit 1ae2655 into apache:main Feb 27, 2025
6 checks passed

github-project-automation bot moved this from Open to Merged in OpenSearch Lucene & Core Performance Tracking Feb 27, 2025

jpountz added a commit that referenced this pull request Feb 27, 2025

Reciprocal Rank Fusion (RRF) in TopDocs (#13470)

cf22ec4

Co-authored-by: SuperSonicVox <hackchang0715@gmail.com> Co-authored-by: Adrien Grand <jpountz@gmail.com>

hanbj pushed a commit to hanbj/lucene that referenced this pull request Mar 12, 2025

Reciprocal Rank Fusion (RRF) in TopDocs (apache#13470)

73db703

Co-authored-by: SuperSonicVox <hackchang0715@gmail.com> Co-authored-by: Adrien Grand <jpountz@gmail.com>

	public static TopDocs rrf(int TopN, int k, TopDocs[] hits) {
	public static TopDocs rrf(int topN, int k, TopDocs[] hits) {

Conversation

harenlin commented Jun 7, 2024

Description

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hack4chang Jun 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harenlin Jun 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alessandrobenedetti commented Jun 9, 2024

Uh oh!

jpountz commented Jun 9, 2024

Uh oh!

jpountz commented Feb 14, 2025

Uh oh!

jpountz commented Feb 21, 2025

Uh oh!

javanna commented Feb 24, 2025

Uh oh!

jpountz commented Feb 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jpountz commented Feb 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hack4chang Jun 8, 2024 •

edited

Loading

harenlin Jun 9, 2024 •

edited

Loading