Skip to content

Conversation

Adityakk9031
Copy link

📌 Summary

This PR fixes a heap-use-after-free detected by ASAN in yb::YBThreadPool::Impl::NotifyWorker, where a Worker could be freed while another thread was concurrently accessing it through waiting_workers.Pop().

🐞 Root Cause

Multiple threads could concurrently pop from waiting_workers.

A Worker in IdleStop state could be erased and deleted while another thread was still reading it.

This resulted in nondeterministic crashes under ASAN and caused the following test to fail:

CDCSDKConsumptionConsistentChangesTest.
TestLSNDeterminismWithSpecialRecordOnRestartWithPartialAck

🔧 Fix

Added waiting_workers_active_pops counter in ThreadPoolShare to track active Pop() operations.

Deferred deletion of Worker objects if pops are active, using a new deferred_deletes_ list.

Ensured Shutdown() waits for all pops to finish before freeing deferred workers.

✅ Validation

Ran ybd asan --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test — no more heap-use-after-free crashes.

Verified that normal enqueue/dequeue behavior is unaffected.

Checked that no leaks remain (all deferred deletes are flushed in shutdown).

📊 Impact

Fixes flaky ASAN test failures in DocDB thread pool.

Minimal overhead: adds two atomic ops per worker notification.

No API changes.

🔗 References

Fixes: #28297

Jira: DB-17979

📌 Summary

This PR fixes a heap-use-after-free detected by ASAN in yb::YBThreadPool::Impl::NotifyWorker, where a Worker could be freed while another thread was concurrently accessing it through waiting_workers.Pop().

🐞 Root Cause

Multiple threads could concurrently pop from waiting_workers.

A Worker in IdleStop state could be erased and deleted while another thread was still reading it.

This resulted in nondeterministic crashes under ASAN and caused the following test to fail:

CDCSDKConsumptionConsistentChangesTest.
TestLSNDeterminismWithSpecialRecordOnRestartWithPartialAck

🔧 Fix

Added waiting_workers_active_pops counter in ThreadPoolShare to track active Pop() operations.

Deferred deletion of Worker objects if pops are active, using a new deferred_deletes_ list.

Ensured Shutdown() waits for all pops to finish before freeing deferred workers.

✅ Validation

Ran ybd asan --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test — no more heap-use-after-free crashes.

Verified that normal enqueue/dequeue behavior is unaffected.

Checked that no leaks remain (all deferred deletes are flushed in shutdown).

📊 Impact

Fixes flaky ASAN test failures in DocDB thread pool.

Minimal overhead: adds two atomic ops per worker notification.

No API changes.

🔗 References

Fixes: yugabyte#28297

Jira: DB-17979
@CLAassistant
Copy link

CLAassistant commented Aug 17, 2025

CLA assistant check
All committers have signed the CLA.

@rthallamko3 rthallamko3 requested a review from spolitov October 9, 2025 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DocDB] Fix heap-use-after-free in yb::YBThreadPool

2 participants