Skip to content

Reduce universal compaction input lock time by forwarding intended compaction and re-picking #13633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented May 22, 2025

Context:
RocksDB currently selects files for long-running compaction outputs to the bottommost level, preventing these selected files files from being selected, but does not execute the compaction immediately like other compactions. Instead, this compaction is forwarded to another Env::Priority::bottom thread pool, where it waits (potentially for a long time) until its thread is ready to execute. This extended L0 lock time in universal compaction caused our users write stall and read performance regression.

Summary:
This PR is to eliminate L0 lock time during bottom priority compaction waiting to execute by the following

  • Create and forward an intended compaction only consists of last input file (or sorted run if non-L0) instead of all the input files. This eliminate the locking for non-bottommost level input files while waiting for bottom priority thread is up to run.
  • Re-pick compaction that outputs to max output level when bottom priority thread is up to run
  • Refactor universal compaction picking logic to make it cleaner and easier to force picking compaction with max output level when bottom priority thread is up to run
  • Guard feature behind a temporary option as requested

Test plan:

  • New unit test to cover the case that's not covered by existing tests - bottom priority thread re-picks compaction ends up picking nothing due to LSM shape changes
  • Adapted existing unit tests to verify various bottom priority compaction behavior with this new option
  • Stress test python3 tools/db_crashtest.py --simple blackbox --compaction_style=1 --target_file_size_base=1000 --write_buffer_size=1000 --compact_range_one_in=10000 --compact_files_one_in=10000
  • db bench./db_bench --use_existing_db=1 --db=/tmp/dbbench_copy_2 --universal_reduce_file_locking={0,1} --max_background_jobs=3 --disable_auto_compactions=0 --enable_pipelined_write=0 --benchmarks=overwrite,waitforcompaction --num=100000 --compaction_style=1 --disable_wal=1 --write_buffer_size=50 --target_file_size_base=50 --num_high_pri_threads=1 --num_low_pri_threads=1 --num_bottom_pri_threads=1 --statistics=1
    • Metrics to observe: rocksdb.stall.micros, rocksdb.compact.write.bytes, rocksdb.bytes.written
    • Without starting the db with a high number of L0s or injecting slowness (by sleeping) in bottom priority thread, there is no difference between universal_reduce_file_locking={0,1}
    • By starting the db with close to 200 L0s (write stall threshold) and injecting 50 seconds of sleep in each bottom priority thread before calling BackgroundCallCompaction(), there is a huge difference in rocksdb.stall.micros but no concerning write amplification increase. Similar effects can be observed for 1 or 5 seconds of injected sleep.
// universal_reduce_file_locking = 0
rocksdb.stall.micros COUNT : 53778696 // Similar to the time length of bottom priority thread sleeping
rocksdb.compact.write.bytes COUNT : 32014010
rocksdb.bytes.written COUNT : 13100000  // compaction write amplification: x2.44

// universal_reduce_file_locking = 1
rocksdb.stall.micros COUNT : 1394731 (-97%) 
rocksdb.compact.write.bytes COUNT : 25552026
rocksdb.bytes.written COUNT : 13100000 // compaction write amplification: x1.95

@hx235 hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from 6e2e976 to c23b195 Compare May 22, 2025 18:13
@hx235 hx235 changed the title Pick universal compaction based on thread pri Forward intended compaction and (Re-)pick universal compaction based on thread priority May 22, 2025
@hx235 hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from c23b195 to 2ba479a Compare May 22, 2025 19:55
@hx235 hx235 requested a review from cbi42 May 22, 2025 22:45
@hx235 hx235 marked this pull request as draft May 28, 2025 00:02
@hx235 hx235 marked this pull request as ready for review May 28, 2025 00:03
} else if (!is_prepicked && !compaction_queue_.empty()) {
if (HasExclusiveManualCompaction()) {
} else if (ShouldPickCompaction(is_prepicked, prepicked_compaction)) {
if (!is_prepicked && HasExclusiveManualCompaction()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also respect exclusive manual compaction here even for the major compaction intent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fixed

Copy link
Contributor Author

@hx235 hx235 Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. There is something subtle: we have to fully give up this prepicked compaction and let low pri naturally prepicks it again, instead of keeping this prepicked compaction (i.e, bg_bottom_compaction_scheduled_++) leading to a deadlock. This is because when exclusive manual compaction waits for background compactions to finish after register its compaction into the manual queue, it expects scheduled background compactions eventually to be 0.

AddManualCompaction(&manual);
TEST_SYNC_POINT_CALLBACK("DBImpl::RunManualCompaction:NotScheduled", &mutex_);
if (exclusive) {
// Limitation: there's no way to wake up the below loop when user sets
// `*manual.canceled`. So `CompactRangeOptions::exclusive_manual_compaction`
// and `CompactRangeOptions::canceled` might not work well together.
while (bg_bottom_compaction_scheduled_ > 0 ||
bg_compaction_scheduled_ > 0) {

@hx235 hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from 2ba479a to 809f78e Compare May 31, 2025 03:32
@hx235 hx235 marked this pull request as draft May 31, 2025 03:32
@hx235 hx235 changed the title Forward intended compaction and (Re-)pick universal compaction based on thread priority [WIP] Forward intended compaction and (Re-)pick universal compaction based on thread priority May 31, 2025
@hx235 hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch 2 times, most recently from 32c7fb1 to c4cc016 Compare June 4, 2025 21:24
@hx235 hx235 changed the title [WIP] Forward intended compaction and (Re-)pick universal compaction based on thread priority Forward intended compaction and (Re-)pick universal compaction based on thread priority Jun 4, 2025
@hx235 hx235 requested a review from cbi42 June 4, 2025 23:32
@hx235 hx235 marked this pull request as ready for review June 4, 2025 23:32
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Comment on lines +3764 to +3759
EnqueuePendingCompaction(cfd);
MaybeScheduleFlushOrCompaction();
Copy link
Contributor Author

@hx235 hx235 Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbi42 - I replaced AddToCompactionQueue to be EnqueuePendingCompaction here and other places in the file in order to respect the precondition before adding. It's especially important here cuz by the time bottom pri reaches here, low pri can already enqueue the cfd since bottom pri references to the cfd in the CompactionArg not from the queue.

Copy link
Member

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Left some comments on details.

if (HasExclusiveManualCompaction()) {
// Can't compact right now, but try again later
TEST_SYNC_POINT("DBImpl::BackgroundCompaction()::Conflict");

// Stay in the compaction queue.
unscheduled_compactions_++;
if (need_repick) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not realize that letting the major compaction run is the existing behavior. I'm not sure if you want to change the existing behavior here when enabling your feature. May be a good follow up.

Copy link
Contributor Author

@hx235 hx235 Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Letting major compaction run (but not letting it to be picked) is the existing behavior. Either way is fine with me. Reverted my change and extra sync point to test that.

@hx235 hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from c4cc016 to ba17d42 Compare June 10, 2025 05:24
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235 hx235 requested a review from cbi42 June 10, 2025 14:25
@hx235 hx235 changed the title Forward intended compaction and (Re-)pick universal compaction based on thread priority Forward intended compaction and repick universal compaction based on thread priority Jun 11, 2025
@hx235 hx235 changed the title Forward intended compaction and repick universal compaction based on thread priority Reduce universal compaction input lock time with forwarding intended compaction Jun 11, 2025
@hx235 hx235 changed the title Reduce universal compaction input lock time with forwarding intended compaction Reduce universal compaction input lock time by forwarding intended compaction and re-picking Jun 11, 2025
// of input files when bottom priority compactions are waiting to run. This
// can increase the likelihood of existing L0s being selected for compaction,
// thereby improving write stall and reducing read regression. It may increase
// the overrall write amplification and compaction load on low priority
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can db_bench this. With a high write rate, we can measure the write-amp and write stall rate with this option on and off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try that! Thanks!

Copy link
Member

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from ba17d42 to 2465974 Compare June 12, 2025 16:41
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from 2465974 to 08e9451 Compare June 13, 2025 00:14
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 merged this pull request in 02bce9b.

SoujanyaPonnapalli pushed a commit to tyler-griggs/rocksdb that referenced this pull request Jul 20, 2025
…mpaction and re-picking (facebook#13633)

Summary:
**Context:**
RocksDB currently selects files for long-running compaction outputs to the bottommost level, preventing these selected files files from being selected, but does not execute the compaction immediately like other compactions. Instead, this compaction is forwarded to another Env::Priority::bottom thread pool, where it waits (potentially for a long time) until its thread is ready to execute. This extended L0 lock time in universal compaction caused our users write stall and read performance regression.

**Summary:**
This PR is to eliminate L0 lock time during bottom priority compaction waiting to execute by the following
- Create and forward an intended compaction only consists of last input file (or sorted run if non-L0) instead of all the input files. This eliminate the locking for non-bottommost level input files while waiting for bottom priority thread is up to run.
- Re-pick compaction that outputs to max output level when bottom priority thread is up to run
- Refactor universal compaction picking logic to make it cleaner and easier to force picking compaction with max output level when bottom priority thread is up to run
- Guard feature behind a temporary option as requested

Pull Request resolved: facebook#13633

Test Plan:
- New unit test to cover the case that's not covered by existing tests - bottom priority thread re-picks compaction ends up picking nothing due to LSM shape changes
- Adapted existing unit tests to verify various bottom priority compaction behavior with this new option
- Stress test `python3 tools/db_crashtest.py --simple blackbox --compaction_style=1 --target_file_size_base=1000 --write_buffer_size=1000 --compact_range_one_in=10000 --compact_files_one_in=10000 `

Reviewed By: cbi42

Differential Revision: D76005505

Pulled By: hx235

fbshipit-source-id: 9688f22d4a84f619452820f12f15b765c17301fd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants