Skip to content

[ENH] Add block-level metrics #4801

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

tanujnay112
Copy link
Contributor

@tanujnay112 tanujnay112 commented Jun 10, 2025

Description of changes

Added block-level metrics. In the current version of this PR, it adds the following metrics that one can view from Prometheus:

  • block_num_get_requests

  • block_commit_latency

  • block_num_blocks_flushed

  • block_flush_latency

  • Improvements & Bug fixes

    • Got rid of a few extraneous tracing spans.
  • New functionality

    • New metrics listed above

Test plan

How are these changes tested?

  • [v] Tests pass locally with pytest for python, yarn test for js, cargo test for rust
  • Manually verified the metric changes on local Prometheus after disabling caching.

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

@tanujnay112 tanujnay112 requested a review from sanketkedia June 10, 2025 01:16
@tanujnay112 tanujnay112 changed the title added s3 get metrics and verified on prometheus [ENH] Added s3 get metrics and verified on prometheus Jun 10, 2025
Copy link
Contributor

propel-code-bot bot commented Jun 10, 2025

Add Block-Level Metrics and Refactor Tracing/Instrumentation

This PR introduces block-level Prometheus metrics to the block store layer, specifically tracking cold block get requests, commit and flush latencies, and number of blocks flushed. It also streamlines the tracing infrastructure for block operations, introduces a utility to access the current trace ID, and removes some previously extraneous tracing spans.

Key Changes:
• Introduces new Prometheus metrics: block_num_get_requests, block_commit_latency, block_num_blocks_flushed, block_flush_latency.
• Adds a BlockMetrics struct to the block manager with OpenTelemetry histograms for these metrics.
• Records metrics at key points in block commit, get, and flush operations and associates them with the current trace ID.
• Refactors S3 storage tracing-for cold gets, uses more targeted spans and moves metric emission closer to the actual S3 API call.
• Adds get_current_trace_id utility for unified trace context propagation.
• Removes extraneous tracing spans and cleans up tracing code.
• Updates dependencies in Cargo.toml and Cargo.lock to support the new tracing and metric functionality.

Affected Areas:
• rust/blockstore/src/arrow/provider.rs
• rust/blockstore/Cargo.toml
• rust/frontend/src/tower_tracing.rs
• rust/storage/src/s3.rs
• rust/tracing/src/util.rs
• Cargo.lock

This summary was automatically generated by @propel-code-bot

Copy link

github-actions bot commented Jun 10, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link

Please tag your PR title with one of: [ENH | BUG | DOC | TST | BLD | PERF | TYP | CLN | CHORE]. See https://docs.trychroma.com/contributing#contributing-code-and-ideas

@tanujnay112 tanujnay112 changed the title [ENH] Added s3 get metrics and verified on prometheus [ENH] Added object store-level metrics Jun 10, 2025
@tanujnay112 tanujnay112 changed the title [ENH] Added object store-level metrics [ENH] Ad object store-level metrics Jun 11, 2025
@tanujnay112 tanujnay112 changed the title [ENH] Ad object store-level metrics [ENH] Add object store-level metrics Jun 11, 2025
Copy link

Please tag your PR title with one of: [ENH | BUG | DOC | TST | BLD | PERF | TYP | CLN | CHORE]. See https://docs.trychroma.com/contributing#contributing-code-and-ideas

impl S3StorageMetrics {
pub fn new(meter: Meter) -> Self {
Self {
total_num_get_requests: meter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: num_get_requests()?

pub fn new(meter: Meter) -> Self {
Self {
total_num_get_requests: meter
.u64_counter("s3_storage_total_num_get_requests")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s3_num_get_requests?

@@ -125,6 +145,7 @@ impl S3Storage {
match res {
Ok(res) => {
let byte_stream = res.body;
self.metrics.total_num_get_requests.add(1, &[]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not in get_with_e_tag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought I'd put it closer to where the S3 call happens in case someone else calls get_stream_and_e_tag some time in the future. Right now get_with_e_tag calls get_stream_and_e_tag where this code snippet is.

@tanujnay112 tanujnay112 force-pushed the tanuj/object_store_metrics branch 2 times, most recently from b9263ee to 216fc54 Compare June 13, 2025 20:31

let trace_id = get_current_trace_id().to_string();
let attribute = [KeyValue::new("trace_id", trace_id)];
self.metrics.num_get_requests.record(1, &attribute);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could track and emit these metrics at block manager level?

@tanujnay112 tanujnay112 changed the title [ENH] Add object store-level metrics [ENH] Add block-level metrics Jun 16, 2025
@tanujnay112 tanujnay112 force-pushed the tanuj/object_store_metrics branch from b842838 to a3e8545 Compare June 17, 2025 17:31
@tanujnay112 tanujnay112 requested a review from sanketkedia June 17, 2025 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants