Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: optimize specific indices #2192

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Apr 12, 2024

  • Allows optimizing a specific set of indices
  • Allows creating a delta index for a specific set of fragments
  • Eliminates the index metadata caching in Python, since it had no invalidation.

Copy link

ACTION NEEDED

Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@wjones127
Copy link
Contributor Author

Will need to rebase on #2132, so waiting on that to merge before this is ready for review.

feat: implement new parameters

refactor: move scalar index optimize into a different file

expose options in Pyhton

test in python
@wjones127 wjones127 changed the title wip: allow more control in creating delta indices feat: optimize specific indices Apr 27, 2024
Comment on lines 199 to +200
def list_indices(self) -> List[Dict[str, Any]]:
if getattr(self, "_list_indices_res", None) is None:
self._list_indices_res = self._ds.load_indices()
return self._list_indices_res
return self._ds.load_indices()
Copy link
Contributor Author

@wjones127 wjones127 Apr 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this cache. There was no invalidation so it is often wrong.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 92.83747% with 26 lines in your changes are missing coverage. Please review.

Project coverage is 80.99%. Comparing base (dbfb640) to head (356a179).
Report is 4 commits behind head on main.

Files Patch % Lines
rust/lance/src/index/scalar.rs 85.45% 3 Missing and 5 partials ⚠️
rust/lance/src/index.rs 96.86% 0 Missing and 7 partials ⚠️
rust/lance/src/index/append.rs 91.22% 1 Missing and 4 partials ⚠️
rust/lance-index/src/optimize.rs 40.00% 2 Missing and 1 partial ⚠️
rust/lance/src/index/vector/ivf.rs 84.21% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2192      +/-   ##
==========================================
- Coverage   81.23%   80.99%   -0.24%     
==========================================
  Files         187      190       +3     
  Lines       54698    55797    +1099     
  Branches    54698    55797    +1099     
==========================================
+ Hits        44434    45195     +761     
- Misses       7771     8081     +310     
- Partials     2493     2521      +28     
Flag Coverage Δ
unittests 80.99% <92.83%> (-0.24%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 marked this pull request as ready for review April 27, 2024 02:47
Comment on lines +2286 to +2288
self,
merge_indices: bool | int | List[str] = 1,
index_new_data: bool | List[int] = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do self, *, merge_indices, index_new_data?

Comment on lines +2305 to +2309
merge_indices: bool | int | List[str]
If True, all indices will be merged. If False, no indices will be
merged and instead a new index delta will be created. If an integer,
that number of indices will be merged. If a list of UUID strings,
those specific indices will be merged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a little confused here because I didn't know if you were talking about "pass in UUIDs if you only want to update some columns (e.g. multiple vector embeddings, each with an index, and only update a few)" or if you were talking about "pass in UUIDs here if you have multiple deltas in the same column to merge together"

@@ -2298,8 +2299,34 @@ def optimize_indices(self, **kwargs):
the new data to existing partitions. This means an update is much quicker
than retraining the entire index but may have less accuracy (especially
if the new data exhibits new patterns, concepts, or trends)
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update some of this wording so that we also explain what "merging" is and why you would want to do it?

options.new_data_handling = NewDataHandling::Fragments(index_new_data_ids);
} else {
return Err(PyValueError::new_err(
"index_new_data must be a boolean value.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a list of fragment ids?

// Note: since we put other delta indexes into data stream, this
// is also the case where we merge indices.
(Some(existing_index), Some(new_data_stream)) => {
// TODO: how can I downcast `existing_index` and use that?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I ever did find a good way to solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants