-
Notifications
You must be signed in to change notification settings - Fork 517
feat: allow creating empty scalar indices #4033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow creating empty scalar indices #4033
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
e302f13 to
84b3c6b
Compare
03c0b28 to
70861c4
Compare
Adds train parameter to create_index to allow creating empty indices without immediate training. When train=False, creates index with empty fragment_bitmap storing only configuration. Changes index retention behavior to keep indices even when all data is deleted or overwritten, unless user explicitly drops them. Closes lance-format#3940
70861c4 to
498551d
Compare
FTS indices must always be returned from load_scalar_index even when their effective fragment bitmap is empty, because FTS queries require an index to exist to provide term frequencies for scoring. The query execution layer handles empty bitmaps appropriately by falling back to scanning unindexed data. Other index types (btree, bitmap) can be skipped when empty since they're optional query optimizations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Made execute() method private - Implemented IntoFuture trait using BoxFuture for simplicity - Updated all tests to use builder.await instead of builder.execute().await - Added rustdoc examples showing the new API usage The builder can now be used directly with await: ```rust dataset.create_index_builder(...).await?; ``` 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Changed vector index error from Error::Index to Error::NotSupported - Updated Python bindings to use .infer_error() for proper exception mapping - Updated test to expect NotImplementedError instead of OSError All Python tests now pass with proper exception types. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Fixed clippy warnings about string_to_string - Fixed Python line length issue in test - Ran cargo fmt and ruff format All linting checks now pass. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
…havior The test was expecting indices to be deleted when all fragments are removed, but our new retention behavior keeps indices even when their data is deleted. This allows index configurations to persist through data changes. Updated the test to: - Expect 1 index instead of 0 after deleting all data - Verify the index has an empty effective fragment bitmap - Confirm the new retention behavior is working correctly 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
The test_merge_insert_updates_indices was failing because it expected fragment bitmaps to be updated when fragments were removed. With our new immutable bitmap behavior: - Index fragment bitmaps are never modified after creation - effective_fragment_bitmap() computes intersection with current fragments - Tests now check effective bitmaps instead of raw bitmaps Updated all three index checks (id, value, other_value) to use effective bitmaps, which correctly represent the valid fragments for each index given the current dataset state. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Fixed clippy warnings about string_to_string - Fixed Python line length issue in test - Ran cargo fmt and ruff format All linting checks now pass. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4033 +/- ##
==========================================
+ Coverage 81.88% 81.97% +0.08%
==========================================
Files 302 303 +1
Lines 123146 123444 +298
Branches 123146 123444 +298
==========================================
+ Hits 100836 101188 +352
+ Misses 18507 18463 -44
+ Partials 3803 3793 -10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| params, | ||
| name: None, | ||
| replace: false, | ||
| train: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we auto-set train to false if there is no data in the dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds reasonable. I will try that.
Instead of setting train parameter in the constructor (which failed due to async calls in non-async context), implemented the logic inside execute(): - When train=true is requested, check if dataset.count_rows() > 0 - If dataset is empty, automatically set train=false - If dataset has data, proceed with train=true as requested - If train=false was explicitly requested, respect that choice This allows users to always use train=true and have the system automatically handle empty datasets appropriately. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
jackye1995
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Closes #3940
Adds a new
trainparameter tocreate_index.train=False, the index will be created empty, with an emptyfragment_bitmap. It just exists to store the index configuration.CreateIndexBuilderto add this parameter in Rust. This allowed us to avoid having to break the current Rust API.Changed behavior of index retention: now when all data in an index has been deleted or overwritten, we may keep the index file still if there are no other files under that name. This means indexes are not dropped until user explicitly asks they are.
Example
dataset .create_index_builder(&["category"], IndexType::Bitmap, ¶ms) .train(false) .await?;