Skip to content

[CHORE]: sync recent garbage collection PRs to last release branch #4838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

## Description of changes

Reduces span spam and necessary nesting in traces.

A consequence is that spans that run in worker threads are no longer
tagged as such, but the task type is usually obvious when looking at the
code. If this is undesired I can add a task type span.

Before:

<img width="1943" alt="Screenshot 2025-06-12 at 09 55 48"
src="https://github.com/user-attachments/assets/e721968e-0742-4630-a342-b5e8fcad97da"
/>

After:

<img width="2221" alt="Screenshot 2025-06-12 at 09 57 46"
src="https://github.com/user-attachments/assets/673a19aa-4987-4c08-89d8-934ae495e6b4"
/>


## Test plan

_How are these changes tested?_

Verified in Jaeger.

## Documentation Changes

_Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_

n/a
## Description of changes

In production, `max_collections_to_gc` is 10k which means that up to 10k
jobs can be running at the same time. This adds a limiter so that there
are at most 100 jobs running concurrently.

## Test plan

_How are these changes tested?_

- [x] Tests pass locally with `pytest` for python, `yarn test` for js,
`cargo test` for rust

## Documentation Changes

_Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_

n/a
…debug (#4824)

## Description of changes

`trace` spans can log an unbounded amount of data, so this should help
us avoid running into exporter payload size limits.

## Test plan

_How are these changes tested?_

Looks good in Jaeger.

## Documentation Changes

_Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_

n/a
## Description of changes

These changes will allow us to add
`go.opentelemetry.io/contrib/instrumentation/github.com/aws/aws-sdk-go-v2/otelaws`
for automatic tracing on S3 calls.

## Test plan

_How are these changes tested?_

- [x] Tests pass locally with `pytest` for python, `yarn test` for js,
`cargo test` for rust

## Documentation Changes

_Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_

n/a
## Description of changes
 
This will help us debug where time is being spent.

## Test plan

_How are these changes tested?_

Looks good in Jaeger.

## Documentation Changes

_Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_

n/a
- Moves all S3 spans to the `trace` level.
- All unbounded span events I could find are now removed or emitted at
the `trace` level.
- Prevents a misleading log.

_How are these changes tested?_

- [x] Tests pass locally with `pytest` for python, `yarn test` for js,
`cargo test` for rust

Looks good in Jaeger.

_Are all docstrings for user-facing APIs updated if required? Do we need
to make documentation changes in the [docs
section](https://github.com/chroma-core/chroma/tree/main/docs/docs.trychroma.com)?_

n/a
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@codetheweb codetheweb changed the title [ENH]: remove "Dispatcher" spans from tracing (#4829) [CHORE]: sync recent garbage collection PRs to last release branch Jun 12, 2025
@codetheweb codetheweb marked this pull request as ready for review June 12, 2025 23:17
Copy link
Contributor

Sync Garbage Collection Tracing and SysDb S3 SDK Upgrades to Release Branch

This PR backports several upstream changes focused on garbage collection improvements and modernization of SysDb's S3 integration. Key upgrades include migrating the Go S3 client from AWS SDK v1 to v2, introducing enhanced (and better-scoped) tracing for S3 and garbage collector operations, enforcing job concurrency limits in Rust garbage collection, and adjusting default tracing levels to reduce Jaeger span noise. The SysDb S3 code now uses context throughout, and all S3 API calls are traced at the 'trace' level.

Key Changes:
• Go S3 SDK upgraded from github.com/aws/aws-sdk-go (v1) to github.com/aws/aws-sdk-go-v2
• All SysDb S3 read/write/delete operations now accept context and emit OpenTelemetry spans at 'trace' level
• SysDb S3 API and coordinator refactored to use context-aware signatures throughout
• Go.mod updated with new dependencies (aws-sdk-go-v2, otelaws, and related) and dropped v1/v1-related deps
• Rust garbage collector: job concurrency is now limited with buffer_unordered (max 100 jobs), improving resource usage
• Tracing in Rust GC and S3 is consolidated: dispatcher-level traces removed to reduce span spam, span levels tuned to default to debug (vs. trace) for high-traffic ops
• Tracing verbosity for potentially unbounded log messages reduced (mostly hidden behind trace-level filtering)
• Multiple test utility and mock signatures updated to align with new APIs

Affected Areas:
• go/pkg/sysdb/metastore/s3/* (major refactor, context propagation, otel)
• go/go.mod (deps overhaul for new S3 SDK and OpenTelemetry support)
• go/pkg/sysdb/coordinator/table_catalog.go + related test mocks (context/call signature changes)
• rust/garbage_collector/src/garbage_collector_component.rs (job concurrency, clearer tracing)
• rust/garbage_collector/src/operators/* and src/execution/* (tracing refactors)
• rust/storage/src/s3.rs (tracing levels, spans)
• tracing/src/init_tracer.rs (trace level tuning)

This summary was automatically generated by @propel-code-bot

@codetheweb codetheweb closed this Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant