Reduce universal compaction input lock time by forwarding intended compaction and re-picking #13633

hx235 · 2025-05-22T18:11:16Z

Context:
RocksDB currently selects files for long-running compaction outputs to the bottommost level, preventing these selected files files from being selected, but does not execute the compaction immediately like other compactions. Instead, this compaction is forwarded to another Env::Priority::bottom thread pool, where it waits (potentially for a long time) until its thread is ready to execute. This extended L0 lock time in universal compaction caused our users write stall and read performance regression.

Summary:
This PR is to eliminate L0 lock time during bottom priority compaction waiting to execute by the following

Create and forward an intended compaction only consists of last input file (or sorted run if non-L0) instead of all the input files. This eliminate the locking for non-bottommost level input files while waiting for bottom priority thread is up to run.
Re-pick compaction that outputs to max output level when bottom priority thread is up to run
Refactor universal compaction picking logic to make it cleaner and easier to force picking compaction with max output level when bottom priority thread is up to run
Guard feature behind a temporary option as requested

Test plan:

New unit test to cover the case that's not covered by existing tests - bottom priority thread re-picks compaction ends up picking nothing due to LSM shape changes
Adapted existing unit tests to verify various bottom priority compaction behavior with this new option
Stress test python3 tools/db_crashtest.py --simple blackbox --compaction_style=1 --target_file_size_base=1000 --write_buffer_size=1000 --compact_range_one_in=10000 --compact_files_one_in=10000
db bench./db_bench --use_existing_db=1 --db=/tmp/dbbench_copy_2 --universal_reduce_file_locking={0,1} --max_background_jobs=3 --disable_auto_compactions=0 --enable_pipelined_write=0 --benchmarks=overwrite,waitforcompaction --num=100000 --compaction_style=1 --disable_wal=1 --write_buffer_size=50 --target_file_size_base=50 --num_high_pri_threads=1 --num_low_pri_threads=1 --num_bottom_pri_threads=1 --statistics=1
- Metrics to observe: rocksdb.stall.micros, rocksdb.compact.write.bytes, rocksdb.bytes.written
- Without starting the db with a high number of L0s or injecting slowness (by sleeping) in bottom priority thread, there is no difference between universal_reduce_file_locking={0,1}
- By starting the db with close to 200 L0s (write stall threshold) and injecting 50 seconds of sleep in each bottom priority thread before calling BackgroundCallCompaction(), there is a huge difference in rocksdb.stall.micros but no concerning write amplification increase. Similar effects can be observed for 1 or 5 seconds of injected sleep.

// universal_reduce_file_locking = 0
rocksdb.stall.micros COUNT : 53778696 // Similar to the time length of bottom priority thread sleeping
rocksdb.compact.write.bytes COUNT : 32014010
rocksdb.bytes.written COUNT : 13100000  // compaction write amplification: x2.44

// universal_reduce_file_locking = 1
rocksdb.stall.micros COUNT : 1394731 (-97%) 
rocksdb.compact.write.bytes COUNT : 25552026
rocksdb.bytes.written COUNT : 13100000 // compaction write amplification: x1.95

cbi42 · 2025-05-29T04:34:19Z

db/db_impl/db_impl_compaction_flush.cc

-  } else if (!is_prepicked && !compaction_queue_.empty()) {
-    if (HasExclusiveManualCompaction()) {
+  } else if (ShouldPickCompaction(is_prepicked, prepicked_compaction)) {
+    if (!is_prepicked && HasExclusiveManualCompaction()) {


Should we also respect exclusive manual compaction here even for the major compaction intent?

To be fixed

Fixed. There is something subtle: we have to fully give up this prepicked compaction and let low pri naturally prepicks it again, instead of keeping this prepicked compaction (i.e, bg_bottom_compaction_scheduled_++) leading to a deadlock. This is because when exclusive manual compaction waits for background compactions to finish after register its compaction into the manual queue, it expects scheduled background compactions eventually to be 0.

rocksdb/db/db_impl/db_impl_compaction_flush.cc

Lines 2075 to 2082 in eaa4f9d

AddManualCompaction(&manual);

TEST_SYNC_POINT_CALLBACK("DBImpl::RunManualCompaction:NotScheduled", &mutex_);

if (exclusive) {

// Limitation: there's no way to wake up the below loop when user sets

// `*manual.canceled`. So `CompactRangeOptions::exclusive_manual_compaction`

// and `CompactRangeOptions::canceled` might not work well together.

while (bg_bottom_compaction_scheduled_ > 0 ||

bg_compaction_scheduled_ > 0) {

db/db_impl/db_impl_compaction_flush.cc

db/db_compaction_test.cc

db/column_family.cc

db/db_compaction_test.cc

facebook-github-bot · 2025-06-05T01:15:22Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

hx235 · 2025-06-06T00:27:54Z

db/db_impl/db_impl_compaction_flush.cc

+          EnqueuePendingCompaction(cfd);
+          MaybeScheduleFlushOrCompaction();


@cbi42 - I replaced AddToCompactionQueue to be EnqueuePendingCompaction here and other places in the file in order to respect the precondition before adding. It's especially important here cuz by the time bottom pri reaches here, low pri can already enqueue the cfd since bottom pri references to the cfd in the CompactionArg not from the queue.

cbi42

Looks great! Left some comments on details.

db/db_compaction_test.cc

db/column_family.h

db/compaction/compaction_picker_universal.cc

db/db_impl/db_impl_compaction_flush.cc

cbi42 · 2025-06-09T03:27:26Z

db/db_impl/db_impl_compaction_flush.cc

    if (HasExclusiveManualCompaction()) {
      // Can't compact right now, but try again later
      TEST_SYNC_POINT("DBImpl::BackgroundCompaction()::Conflict");

-      // Stay in the compaction queue.
-      unscheduled_compactions_++;
+      if (need_repick) {


Sorry, I did not realize that letting the major compaction run is the existing behavior. I'm not sure if you want to change the existing behavior here when enabling your feature. May be a good follow up.

Letting major compaction run (but not letting it to be picked) is the existing behavior. Either way is fine with me. Reverted my change and extra sync point to test that.

db/db_impl/db_impl_compaction_flush.cc

unreleased_history/new_features/reduce_file_locking.md

facebook-github-bot · 2025-06-10T05:24:50Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

cbi42 · 2025-06-12T04:09:57Z

include/rocksdb/universal_compaction.h

+  // of input files when bottom priority compactions are waiting to run. This
+  // can increase the likelihood of existing L0s being selected for compaction,
+  // thereby improving write stall and reducing read regression. It may increase
+  // the overrall write amplification and compaction load on low priority


I wonder if we can db_bench this. With a high write rate, we can measure the write-amp and write stall rate with this option on and off.

Will try that! Thanks!

cbi42

LGTM!

facebook-github-bot · 2025-06-12T16:26:00Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-06-12T16:41:47Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-06-12T16:42:14Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-06-13T00:15:03Z

@hx235 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2025-06-13T00:16:40Z

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-06-13T01:19:45Z

@hx235 merged this pull request in 02bce9b.

…mpaction and re-picking (facebook#13633) Summary: **Context:** RocksDB currently selects files for long-running compaction outputs to the bottommost level, preventing these selected files files from being selected, but does not execute the compaction immediately like other compactions. Instead, this compaction is forwarded to another Env::Priority::bottom thread pool, where it waits (potentially for a long time) until its thread is ready to execute. This extended L0 lock time in universal compaction caused our users write stall and read performance regression. **Summary:** This PR is to eliminate L0 lock time during bottom priority compaction waiting to execute by the following - Create and forward an intended compaction only consists of last input file (or sorted run if non-L0) instead of all the input files. This eliminate the locking for non-bottommost level input files while waiting for bottom priority thread is up to run. - Re-pick compaction that outputs to max output level when bottom priority thread is up to run - Refactor universal compaction picking logic to make it cleaner and easier to force picking compaction with max output level when bottom priority thread is up to run - Guard feature behind a temporary option as requested Pull Request resolved: facebook#13633 Test Plan: - New unit test to cover the case that's not covered by existing tests - bottom priority thread re-picks compaction ends up picking nothing due to LSM shape changes - Adapted existing unit tests to verify various bottom priority compaction behavior with this new option - Stress test `python3 tools/db_crashtest.py --simple blackbox --compaction_style=1 --target_file_size_base=1000 --write_buffer_size=1000 --compact_range_one_in=10000 --compact_files_one_in=10000 ` Reviewed By: cbi42 Differential Revision: D76005505 Pulled By: hx235 fbshipit-source-id: 9688f22d4a84f619452820f12f15b765c17301fd

facebook-github-bot added the CLA Signed label May 22, 2025

hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from 6e2e976 to c23b195 Compare May 22, 2025 18:13

hx235 changed the title ~~Pick universal compaction based on thread pri~~ Forward intended compaction and (Re-)pick universal compaction based on thread priority May 22, 2025

hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from c23b195 to 2ba479a Compare May 22, 2025 19:55

hx235 requested a review from cbi42 May 22, 2025 22:45

hx235 marked this pull request as draft May 28, 2025 00:02

hx235 marked this pull request as ready for review May 28, 2025 00:03

cbi42 reviewed May 29, 2025

View reviewed changes

hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from 2ba479a to 809f78e Compare May 31, 2025 03:32

hx235 marked this pull request as draft May 31, 2025 03:32

hx235 changed the title ~~Forward intended compaction and (Re-)pick universal compaction based on thread priority~~ [WIP] Forward intended compaction and (Re-)pick universal compaction based on thread priority May 31, 2025

hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch 2 times, most recently from 32c7fb1 to c4cc016 Compare June 4, 2025 21:24

hx235 changed the title ~~[WIP] Forward intended compaction and (Re-)pick universal compaction based on thread priority~~ Forward intended compaction and (Re-)pick universal compaction based on thread priority Jun 4, 2025

hx235 requested a review from cbi42 June 4, 2025 23:32

hx235 marked this pull request as ready for review June 4, 2025 23:32

hx235 commented Jun 6, 2025

View reviewed changes

cbi42 reviewed Jun 9, 2025

View reviewed changes

hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from c4cc016 to ba17d42 Compare June 10, 2025 05:24

hx235 requested a review from cbi42 June 10, 2025 14:25

hx235 changed the title ~~Forward intended compaction and (Re-)pick universal compaction based on thread priority~~ Forward intended compaction and repick universal compaction based on thread priority Jun 11, 2025

hx235 changed the title ~~Forward intended compaction and repick universal compaction based on thread priority~~ Reduce universal compaction input lock time with forwarding intended compaction Jun 11, 2025

hx235 changed the title ~~Reduce universal compaction input lock time with forwarding intended compaction~~ Reduce universal compaction input lock time by forwarding intended compaction and re-picking Jun 11, 2025

cbi42 reviewed Jun 12, 2025

View reviewed changes

cbi42 approved these changes Jun 12, 2025

View reviewed changes

hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from ba17d42 to 2465974 Compare June 12, 2025 16:41

hx235 added 2 commits June 12, 2025 17:14

reduce file lock

127ba72

address comments

08e9451

hx235 force-pushed the scheduling_work_no_file_lock_wait_compact branch from 2465974 to 08e9451 Compare June 13, 2025 00:14

facebook-github-bot closed this in 02bce9b Jun 13, 2025

facebook-github-bot added the Merged label Jun 13, 2025

	AddManualCompaction(&manual);
	TEST_SYNC_POINT_CALLBACK("DBImpl::RunManualCompaction:NotScheduled", &mutex_);
	if (exclusive) {
	// Limitation: there's no way to wake up the below loop when user sets
	// `*manual.canceled`. So `CompactRangeOptions::exclusive_manual_compaction`
	// and `CompactRangeOptions::canceled` might not work well together.
	while (bg_bottom_compaction_scheduled_ > 0 \|\|
	bg_compaction_scheduled_ > 0) {

		EnqueuePendingCompaction(cfd);
		MaybeScheduleFlushOrCompaction();

Reduce universal compaction input lock time by forwarding intended compaction and re-picking #13633

Reduce universal compaction input lock time by forwarding intended compaction and re-picking #13633

Uh oh!

Conversation

hx235 commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cbi42 May 29, 2025

Choose a reason for hiding this comment

Uh oh!

hx235 May 31, 2025

Choose a reason for hiding this comment

Uh oh!

hx235 Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Jun 5, 2025

Uh oh!

hx235 Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cbi42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cbi42 Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

hx235 Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Jun 10, 2025

Uh oh!

cbi42 Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

hx235 Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

cbi42 left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 12, 2025

Uh oh!

facebook-github-bot commented Jun 12, 2025

Uh oh!

facebook-github-bot commented Jun 12, 2025

Uh oh!

facebook-github-bot commented Jun 13, 2025

Uh oh!

facebook-github-bot commented Jun 13, 2025

Uh oh!

facebook-github-bot commented Jun 13, 2025

Uh oh!

Uh oh!

hx235 commented May 22, 2025 •

edited

Loading

hx235 Jun 4, 2025 •

edited

Loading

hx235 Jun 6, 2025 •

edited

Loading

hx235 Jun 10, 2025 •

edited

Loading