Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove DocsWithFieldSet reference from NativeEngineFieldVectorsWriter #2408

Merged
merged 4 commits into from
Jan 23, 2025

Conversation

weiwang118
Copy link
Contributor

Description

This change aims to resolve the issue #2207 .

With upgrade to Lucene 9.12, Lucene started exposing the FlatVectorsFormat as KnnVectorsFormat. With this FlatFieldVectorsWriter now exposes DocsWithFieldSet and vectors which are added in the FlatFieldVectorsWriter during flush.

Now NativeEngineFieldVectorsWriter and FlatFieldVectorsWriter stores the same reference of vectors and docIds, which is not required. We can completely get rid of reference of vectors and docIds from NativeEngineFieldVectorsWriter and just use FlatFieldVectorsWriter during flush. This will simplify the code and will also free up some resources from heap.

After deep research, I found out although FlatFieldVectorsWriter exposes vectors and DocsWithFieldSet , but the type of vectors is List, however, the vectors in NativeEngineFieldVectorsWriter is Map<Integer, T>.(Reason see below)

We are using a map here instead of list, because for sampler interface for quantization we have to advance the iterator
to a specific docId, there a list cannot be useful because a docId != index of the vector in the list. Similar
thing is true when we have vector field in child document. There doc Ids will not be consistent. Hence, we need to
use the map here.

So in this pr, I only remove the reference of DocsWithFieldSet from NativeEngineFieldVectorsWriter and keep the original vectors variable there.

Related Issues

Resolves #2207

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@weiwang118 weiwang118 force-pushed the issue-2207 branch 2 times, most recently from 0c8986c to d8342af Compare January 21, 2025 14:53
@navneet1v
Copy link
Collaborator

@weiwang118 please fix the changelog conflicts. Overall code looks good to me.

@navneet1v
Copy link
Collaborator

@weiwang118 this is a good start. We should on how we can remove the limitation of vectors too. We can keep the scope this PR limited for now.

@shatejas
Copy link
Collaborator

The major benefit comes from reusing the vectors from FlatVectorWriter, that will significantly reduce the JVM heap usage during indexing. I am fine with tackling it iteratively as @navneet1v suggested

Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@navneet1v navneet1v merged commit d58d133 into opensearch-project:main Jan 23, 2025
34 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-2408-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 d58d133c6edb9dfc48b5c3e507cdc21dbf0477ad
# Push it to GitHub
git push --set-upstream origin backport/backport-2408-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-2408-to-2.x.

@navneet1v
Copy link
Collaborator

@weiwang118 please do a manual backport

weiwang118 added a commit to weiwang118/k-NN that referenced this pull request Jan 23, 2025
…opensearch-project#2408)

* Remove DocsWithFieldSet reference from NativeEngineFieldVectorsWriter

Signed-off-by: Wei Wang <[email protected]>

* fix typo error in test file

Signed-off-by: Wei Wang <[email protected]>

---------

Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
(cherry picked from commit d58d133)
weiwang118 added a commit to weiwang118/k-NN that referenced this pull request Jan 23, 2025
…opensearch-project#2408)

* Remove DocsWithFieldSet reference from NativeEngineFieldVectorsWriter

Signed-off-by: Wei Wang <[email protected]>

* fix typo error in test file

Signed-off-by: Wei Wang <[email protected]>

---------

Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
(cherry picked from commit d58d133)
weiwang118 added a commit to weiwang118/k-NN that referenced this pull request Jan 23, 2025
…opensearch-project#2408)

* Remove DocsWithFieldSet reference from NativeEngineFieldVectorsWriter

Signed-off-by: Wei Wang <[email protected]>

* fix typo error in test file

Signed-off-by: Wei Wang <[email protected]>

---------

Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
(cherry picked from commit d58d133)
ryanbogan pushed a commit that referenced this pull request Jan 23, 2025
…#2408) (#2426)

* Remove DocsWithFieldSet reference from NativeEngineFieldVectorsWriter

Signed-off-by: Wei Wang <[email protected]>

* fix typo error in test file

Signed-off-by: Wei Wang <[email protected]>

---------

Signed-off-by: Wei Wang <[email protected]>
Signed-off-by: Wei Wang <[email protected]>
(cherry picked from commit d58d133)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement] Remove multiple vectors references during flush
3 participants