[DocDB] Fix heap-use-after-free in yb::YBThreadPool #28299
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📌 Summary
This PR fixes a heap-use-after-free detected by ASAN in yb::YBThreadPool::Impl::NotifyWorker, where a Worker could be freed while another thread was concurrently accessing it through waiting_workers.Pop().
🐞 Root Cause
Multiple threads could concurrently pop from waiting_workers.
A Worker in IdleStop state could be erased and deleted while another thread was still reading it.
This resulted in nondeterministic crashes under ASAN and caused the following test to fail:
CDCSDKConsumptionConsistentChangesTest.
TestLSNDeterminismWithSpecialRecordOnRestartWithPartialAck
🔧 Fix
Added waiting_workers_active_pops counter in ThreadPoolShare to track active Pop() operations.
Deferred deletion of Worker objects if pops are active, using a new deferred_deletes_ list.
Ensured Shutdown() waits for all pops to finish before freeing deferred workers.
✅ Validation
Ran ybd asan --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test — no more heap-use-after-free crashes.
Verified that normal enqueue/dequeue behavior is unaffected.
Checked that no leaks remain (all deferred deletes are flushed in shutdown).
📊 Impact
Fixes flaky ASAN test failures in DocDB thread pool.
Minimal overhead: adds two atomic ops per worker notification.
No API changes.
🔗 References
Fixes: #28297
Jira: DB-17979