Skip to content

Conversation

@wjones127
Copy link
Contributor

@wjones127 wjones127 commented Jun 18, 2025

Closes #3940

  • Adds a new train parameter to create_index.

    • When train=False, the index will be created empty, with an empty fragment_bitmap. It just exists to store the index configuration.
    • This is not yet supported for vector indices, as that will be done in a follow up. Top-level indices part 2: empty vector indices #4034
    • Added a new CreateIndexBuilder to add this parameter in Rust. This allowed us to avoid having to break the current Rust API.
  • Changed behavior of index retention: now when all data in an index has been deleted or overwritten, we may keep the index file still if there are no other files under that name. This means indexes are not dropped until user explicitly asks they are.

Example

 dataset
      .create_index_builder(&["category"], IndexType::Bitmap, &params)
      .train(false)
      .await?;

@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@wjones127 wjones127 changed the title create builder API feat: allow creating empty indices Jun 18, 2025
@github-actions github-actions bot added the enhancement New feature or request label Jun 18, 2025
@wjones127 wjones127 changed the title feat: allow creating empty indices feat: allow creating empty scalar indices Jun 18, 2025
@github-actions github-actions bot added the java label Jun 19, 2025
@wjones127 wjones127 force-pushed the feat/create-empty-indexes branch from e302f13 to 84b3c6b Compare June 19, 2025 16:03
@wjones127 wjones127 force-pushed the feat/create-empty-indexes branch from 03c0b28 to 70861c4 Compare August 11, 2025 21:18
Adds train parameter to create_index to allow creating empty indices without
immediate training. When train=False, creates index with empty fragment_bitmap
storing only configuration.

Changes index retention behavior to keep indices even when all data is deleted
or overwritten, unless user explicitly drops them.

Closes lance-format#3940
@wjones127 wjones127 force-pushed the feat/create-empty-indexes branch from 70861c4 to 498551d Compare August 11, 2025 21:48
wjones127 and others added 9 commits August 11, 2025 17:12
FTS indices must always be returned from load_scalar_index even when
their effective fragment bitmap is empty, because FTS queries require
an index to exist to provide term frequencies for scoring. The query
execution layer handles empty bitmaps appropriately by falling back to
scanning unindexed data.

Other index types (btree, bitmap) can be skipped when empty since
they're optional query optimizations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Made execute() method private
- Implemented IntoFuture trait using BoxFuture for simplicity
- Updated all tests to use builder.await instead of builder.execute().await
- Added rustdoc examples showing the new API usage

The builder can now be used directly with await:
```rust
dataset.create_index_builder(...).await?;
```

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Changed vector index error from Error::Index to Error::NotSupported
- Updated Python bindings to use .infer_error() for proper exception mapping
- Updated test to expect NotImplementedError instead of OSError

All Python tests now pass with proper exception types.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Fixed clippy warnings about string_to_string
- Fixed Python line length issue in test
- Ran cargo fmt and ruff format

All linting checks now pass.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…havior

The test was expecting indices to be deleted when all fragments are removed,
but our new retention behavior keeps indices even when their data is deleted.
This allows index configurations to persist through data changes.

Updated the test to:
- Expect 1 index instead of 0 after deleting all data
- Verify the index has an empty effective fragment bitmap
- Confirm the new retention behavior is working correctly

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
The test_merge_insert_updates_indices was failing because it expected
fragment bitmaps to be updated when fragments were removed. With our
new immutable bitmap behavior:

- Index fragment bitmaps are never modified after creation
- effective_fragment_bitmap() computes intersection with current fragments
- Tests now check effective bitmaps instead of raw bitmaps

Updated all three index checks (id, value, other_value) to use
effective bitmaps, which correctly represent the valid fragments
for each index given the current dataset state.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Fixed clippy warnings about string_to_string
- Fixed Python line length issue in test
- Ran cargo fmt and ruff format

All linting checks now pass.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@wjones127 wjones127 marked this pull request as ready for review August 12, 2025 16:39
@codecov-commenter
Copy link

codecov-commenter commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 85.43307% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.97%. Comparing base (7f5f792) to head (0e1b77a).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/index/create.rs 73.33% 12 Missing ⚠️
rust/lance/src/dataset/transaction.rs 82.75% 9 Missing and 1 partial ⚠️
rust/lance/src/index/scalar.rs 75.75% 7 Missing and 1 partial ⚠️
rust/lance/src/index.rs 88.88% 5 Missing ⚠️
rust/lance-table/src/format/index.rs 85.71% 0 Missing and 1 partial ⚠️
rust/lance/src/dataset/write/merge_insert.rs 97.05% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4033      +/-   ##
==========================================
+ Coverage   81.88%   81.97%   +0.08%     
==========================================
  Files         302      303       +1     
  Lines      123146   123444     +298     
  Branches   123146   123444     +298     
==========================================
+ Hits       100836   101188     +352     
+ Misses      18507    18463      -44     
+ Partials     3803     3793      -10     
Flag Coverage Δ
unittests 81.97% <85.43%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

params,
name: None,
replace: false,
train: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we auto-set train to false if there is no data in the dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds reasonable. I will try that.

wjones127 and others added 2 commits August 12, 2025 13:54
Instead of setting train parameter in the constructor (which failed due to
async calls in non-async context), implemented the logic inside execute():

- When train=true is requested, check if dataset.count_rows() > 0
- If dataset is empty, automatically set train=false
- If dataset has data, proceed with train=true as requested
- If train=false was explicitly requested, respect that choice

This allows users to always use train=true and have the system
automatically handle empty datasets appropriately.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@wjones127 wjones127 requested a review from jackye1995 August 12, 2025 22:50
Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@wjones127 wjones127 merged commit cd76a99 into lance-format:main Aug 13, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Top-level indices part 1: empty index files

3 participants