-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Reduce universal compaction input lock time by forwarding intended compaction and re-picking #13633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce universal compaction input lock time by forwarding intended compaction and re-picking #13633
Conversation
6e2e976
to
c23b195
Compare
c23b195
to
2ba479a
Compare
} else if (!is_prepicked && !compaction_queue_.empty()) { | ||
if (HasExclusiveManualCompaction()) { | ||
} else if (ShouldPickCompaction(is_prepicked, prepicked_compaction)) { | ||
if (!is_prepicked && HasExclusiveManualCompaction()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also respect exclusive manual compaction here even for the major compaction intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. There is something subtle: we have to fully give up this prepicked compaction and let low pri naturally prepicks it again, instead of keeping this prepicked compaction (i.e, bg_bottom_compaction_scheduled_++) leading to a deadlock. This is because when exclusive manual compaction waits for background compactions to finish after register its compaction into the manual queue, it expects scheduled background compactions eventually to be 0.
rocksdb/db/db_impl/db_impl_compaction_flush.cc
Lines 2075 to 2082 in eaa4f9d
AddManualCompaction(&manual); | |
TEST_SYNC_POINT_CALLBACK("DBImpl::RunManualCompaction:NotScheduled", &mutex_); | |
if (exclusive) { | |
// Limitation: there's no way to wake up the below loop when user sets | |
// `*manual.canceled`. So `CompactRangeOptions::exclusive_manual_compaction` | |
// and `CompactRangeOptions::canceled` might not work well together. | |
while (bg_bottom_compaction_scheduled_ > 0 || | |
bg_compaction_scheduled_ > 0) { |
2ba479a
to
809f78e
Compare
32c7fb1
to
c4cc016
Compare
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
EnqueuePendingCompaction(cfd); | ||
MaybeScheduleFlushOrCompaction(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbi42 - I replaced AddToCompactionQueue to be EnqueuePendingCompaction here and other places in the file in order to respect the precondition before adding. It's especially important here cuz by the time bottom pri reaches here, low pri can already enqueue the cfd since bottom pri references to the cfd in the CompactionArg not from the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Left some comments on details.
if (HasExclusiveManualCompaction()) { | ||
// Can't compact right now, but try again later | ||
TEST_SYNC_POINT("DBImpl::BackgroundCompaction()::Conflict"); | ||
|
||
// Stay in the compaction queue. | ||
unscheduled_compactions_++; | ||
if (need_repick) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I did not realize that letting the major compaction run is the existing behavior. I'm not sure if you want to change the existing behavior here when enabling your feature. May be a good follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Letting major compaction run (but not letting it to be picked) is the existing behavior. Either way is fine with me. Reverted my change and extra sync point to test that.
c4cc016
to
ba17d42
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
// of input files when bottom priority compactions are waiting to run. This | ||
// can increase the likelihood of existing L0s being selected for compaction, | ||
// thereby improving write stall and reducing read regression. It may increase | ||
// the overrall write amplification and compaction load on low priority |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can db_bench this. With a high write rate, we can measure the write-amp and write stall rate with this option on and off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try that! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
ba17d42
to
2465974
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
2465974
to
08e9451
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
…mpaction and re-picking (facebook#13633) Summary: **Context:** RocksDB currently selects files for long-running compaction outputs to the bottommost level, preventing these selected files files from being selected, but does not execute the compaction immediately like other compactions. Instead, this compaction is forwarded to another Env::Priority::bottom thread pool, where it waits (potentially for a long time) until its thread is ready to execute. This extended L0 lock time in universal compaction caused our users write stall and read performance regression. **Summary:** This PR is to eliminate L0 lock time during bottom priority compaction waiting to execute by the following - Create and forward an intended compaction only consists of last input file (or sorted run if non-L0) instead of all the input files. This eliminate the locking for non-bottommost level input files while waiting for bottom priority thread is up to run. - Re-pick compaction that outputs to max output level when bottom priority thread is up to run - Refactor universal compaction picking logic to make it cleaner and easier to force picking compaction with max output level when bottom priority thread is up to run - Guard feature behind a temporary option as requested Pull Request resolved: facebook#13633 Test Plan: - New unit test to cover the case that's not covered by existing tests - bottom priority thread re-picks compaction ends up picking nothing due to LSM shape changes - Adapted existing unit tests to verify various bottom priority compaction behavior with this new option - Stress test `python3 tools/db_crashtest.py --simple blackbox --compaction_style=1 --target_file_size_base=1000 --write_buffer_size=1000 --compact_range_one_in=10000 --compact_files_one_in=10000 ` Reviewed By: cbi42 Differential Revision: D76005505 Pulled By: hx235 fbshipit-source-id: 9688f22d4a84f619452820f12f15b765c17301fd
Context:
RocksDB currently selects files for long-running compaction outputs to the bottommost level, preventing these selected files files from being selected, but does not execute the compaction immediately like other compactions. Instead, this compaction is forwarded to another Env::Priority::bottom thread pool, where it waits (potentially for a long time) until its thread is ready to execute. This extended L0 lock time in universal compaction caused our users write stall and read performance regression.
Summary:
This PR is to eliminate L0 lock time during bottom priority compaction waiting to execute by the following
Test plan:
python3 tools/db_crashtest.py --simple blackbox --compaction_style=1 --target_file_size_base=1000 --write_buffer_size=1000 --compact_range_one_in=10000 --compact_files_one_in=10000
./db_bench --use_existing_db=1 --db=/tmp/dbbench_copy_2 --universal_reduce_file_locking={0,1} --max_background_jobs=3 --disable_auto_compactions=0 --enable_pipelined_write=0 --benchmarks=overwrite,waitforcompaction --num=100000 --compaction_style=1 --disable_wal=1 --write_buffer_size=50 --target_file_size_base=50 --num_high_pri_threads=1 --num_low_pri_threads=1 --num_bottom_pri_threads=1 --statistics=1
rocksdb.stall.micros
,rocksdb.compact.write.bytes
,rocksdb.bytes.written
BackgroundCallCompaction()
, there is a huge difference inrocksdb.stall.micros
but no concerning write amplification increase. Similar effects can be observed for 1 or 5 seconds of injected sleep.