Skip to content

[ENH]: Plumb prefix path all the way to the bf writer #4743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: 05-30-_enh_return_database_id_in_get_collections_call_from_sysdb
Choose a base branch
from

Conversation

sanketkedia
Copy link
Contributor

@sanketkedia sanketkedia commented Jun 3, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • Plumbs the tenant and database id from compactor orchestrator to the segment writer.
    • The segment writer constructs the prefix path as per format: tenant/{tenant}/database/{database_id}/collection/{collection_id}/segment/{segment_id}
    • It then passes this prefix in the construction of various blockfile writers as part of the write options
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

None

Copy link
Contributor Author

sanketkedia commented Jun 3, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@sanketkedia sanketkedia marked this pull request as ready for review June 3, 2025 23:27
Copy link
Contributor

propel-code-bot bot commented Jun 3, 2025

Propagate Prefix Path to Blockfile Writer for Per-Segment S3 Hierarchy

This PR introduces explicit plumbed prefix pathing, passing tenant and database ID from the compaction orchestrator all the way through to the blockfile writer (bf writer) in the storage layer. The prefix path, formatted as tenant/{tenant}/database/{database_id}/collection/{collection_id}/segment/{segment_id}, is now constructed within segment writers and used in all blockfile writer and reader construction, affecting record, metadata, and vector/blog-indexing segments. Comprehensive refactoring was performed to enforce explicit prefix path propagation, replace default constructions, and update tests, benchmarks, and library usage accordingly.

Key Changes:
• The BlockfileWriterOptions struct now requires a prefix_path parameter (no default constructor).
• Segment writers (record, metadata, vector) now take tenant and database_id arguments and construct the per-segment prefix path.
• All downstream calls to blockfile_provider.write and related APIs now pass the explicit prefix_path.
• Test, benchmark, and library usage updated for new constructor and plumbed arguments.
• Collection/segment test code updated to assign new, unique DatabaseUuid for each synthetic segment.

Affected Areas:
• blockstore (BlockfileWriterOptions, BlockfileWriter, writer/readers in arrow/ordered/unordered modules)
• segment: blockfile_metadata.rs, blockfile_record.rs, types.rs, test.rs
• index: metadata/types.rs, fulltext/types.rs, spann/types.rs
• worker (compactor, execution operators)
• tests and benches (various, including concurrency and fuzzing tests)
• types: collection.rs (DatabaseUuid, Display impl)

Potential Impact:

Functionality: Blockfile-level operations are now fully namespace-scoped to the correct tenant/database/collection/segment paths; improper or missing prefix propagation will now fail at construction.

Performance: None anticipated for runtime paths; test/bench setup slightly more explicit, marginally more overhead in constructing options.

Security: Improves logical data isolation by enforcing per-tenant/db/collection/segment scoping.

Scalability: Improves maintainability and future scalability for multi-tenant deployments and S3 pathing, ensures explicit partitioning at storage level.

Review Focus:
• Correctness and completeness of prefix_path propagation (no missing or empty paths).
• No accidental reintroduction of default BlockfileWriterOptions construction.
• Upstream/downstream impacts: blockfile interacts as expected in orchestrators, segment writers, and index layers.
• No test or fuzz code relying on legacy implicit paths.

Testing Needed

• Run full suite of unit tests, integration tests, and proptests for blockfile, segment, and index paths.
• Sanity-check S3 and local file storage outputs for correct folder hierarchy.
• Review all initialization/call-sites for BlockfileWriterOptions (no use of default(), all pass correct path).

Code Quality Assessment

rust/types/src/collection.rs: DatabaseUuid added and now present in per-segment path construction.

rust/blockstore/src/types/writer_options.rs: Refactored to disallow default options, forces explicit prefix. Author notes that this is intentional to avoid silent errors.

rust/segment/src/blockfile_metadata.rs: Signature change and full adoption of tenant/db/coll/seg prefix propagation. No violations of explicitness; test code modernized.

rust/segment/src/blockfile_record.rs: Mirror changes as above; read/write code now correctly includes contextual path.

tests/benches: Updated; all calls now pass prefix_path. Benchmarks assign prefix (generic, empty unless operationalized, but passes the compiler gate).

Best Practices

Test-Maintenance:
• Updated all affected tests and benchmarks.
• Validates upgrade path for complex refactors.

Multi-Tenancy/Scoping:
• Properly scopes resource at the lowest (storage) level.

Api-Design:
• Enforces explicit construction; removes reliance on error-prone defaults.
• Force flows to declare their context, improving maintainability and security.

Potential Issues

• Downstream/unupdated out-of-tree project code will break upon upgrade due to breaking changes in BlockfileWriterOptions construction.
• Any missed or incorrectly-constructed prefix_path will cause immediate construction-time or runtime errors.
• Test/benchmark prefixes currently use empty strings; in real scenarios, these must be meaningful/namespaced for S3 partitioning.

This summary was automatically generated by @propel-code-bot

}

impl BlockfileWriterOptions {
pub fn new() -> Self {
Self::default()
pub fn new(prefix_path: String) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is intentional to force all callers to necessarily think about and set a prefix path. There is no default constructor now.

@sanketkedia sanketkedia force-pushed the 05-30-_enh_return_database_id_in_get_collections_call_from_sysdb branch from 4411701 to f9a9e60 Compare June 10, 2025 15:41
@sanketkedia sanketkedia force-pushed the 06-03-_enh_plumb_prefix_path_all_the_way_to_the_bf_writer branch from 573315d to c24e5b0 Compare June 10, 2025 15:42
@sanketkedia sanketkedia force-pushed the 06-03-_enh_plumb_prefix_path_all_the_way_to_the_bf_writer branch from c24e5b0 to a220d89 Compare June 10, 2025 15:48
@@ -1928,9 +1945,9 @@ mod tests {

// Test that a v1.1 writer can read a v1 blockfile and dirty a block
// successfully hydrating counts for ALL blocks it needs to set counts for

let prefix_path = String::from("");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - create a const to hold this

@@ -536,7 +536,9 @@ impl SpannIndexWriter {
blockfile_id: &Uuid,
blockfile_provider: &BlockfileProvider,
) -> Result<BlockfileWriter, SpannIndexWriterError> {
let mut bf_options = BlockfileWriterOptions::new();
// TODO(Sanket-temp): Change this suitably
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this TODO?

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do we want to add any basic sanity testing around the migration of the path?
  2. Should we align with other services prefixes?

segment_id: &SegmentUuid,
) -> String {
format!(
"{}/{}/{}/{}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is different than the prefix used for version files and lineage files ({}/databases/{}/collections/{}), I think we should use the same prefix across everything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants