Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipc: unit_test + TSAN: TSAN lock count limit reached in some tests: some run OK separately; 1 never runs OK; make the latter run OK; investigate further. #89

Open
ygoldfeld opened this issue Mar 12, 2024 · 0 comments
Labels
bug Something isn't working from-akamai-pre-open Issue origin is Akamai, before opening source test Unit and functional tests; demo/example programs

Comments

@ygoldfeld
Copy link
Contributor

ygoldfeld commented Mar 12, 2024

Filed by @ygoldfeld pre-open-source:

The current situation is as follows:

General description: Whether run locally with my clang-17, or in the GitHub pipeline with clang-15/16/17, reliably some tests in some situations hit a certain specific point within the test, at which point console gets

ThreadSanitizer: CHECK failed: sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40) (tid=74526)

and the test hangs forever right there. To be clear this is not a normal TSAN warning about a race or anything; but rather TSAN instrumentation code hitting a problem and refusing to proceed further. By the text of the problem, indeed some sort of limit of 64 "locks with contexts" is reached, and TSAN blows up. (No further analysis done on that but read on.)

1 test, even if run absolutely by itself, always hits this problem: Jemalloc_shm_pool_collection_test.Multiprocess. Hence it is explicitly skipped in the pipeline at the moment, using the gtest command line feature that can exclude tests individually.

The other problematic tests -- meaning that failing to exclude all of them from a run, while keeping all the others => problem -- are:

LOCK_HEAVY_TESTS='Shm_session_test.External_process_array:\
                  Shm_session_test.External_process_vector_offset_ptr:\
                  Shm_session_test.External_process_string_offset_ptr:\
                  Shm_session_test.External_process_list_offset_ptr:\
                  Shm_session_test.Multisession_external_process:\
                  Shm_session_test.Disconnected_external_process:\
                  Borrower_shm_pool_collection_test.Multiprocess:\
                  Shm_pool_collection_test.Multiprocess'

Happily, though, they run just fine in a group -- but not if run as part of all the many other tests. Therefore, to avoid hitting the limitation, I have changed the pipeline to the following:

  • Run all tests minus the above LOCK_HEAVY_TESTS and the 1 never-works test. That's unit_test invocation 1.
  • Run all the LOCK_HEAVY_TESTS. That's unit_test invocation 2.

It is not ideal, but it does give good TSAN coverage, thus reducing the priority of this ticket. The priority somewhat rises due to Jemalloc_shm_pool_collection_test.Multiprocess being unable to complete even by itself however.

As for what to do -- just ideas:

  • First, see if Jemalloc_shm_pool_collection_test.Multiprocess can be rjiggered somewhat just to avoid the problem/hang. This would get us to full coverage, not skipping anything.
  • Next, look into the problem itself.
    • Here it might be a good idea to straight-up go into LLVM code (where TSAN is maintained) -- I have done it for some other topics with reasonable success -- and see what it is that this limit is. See the documentation (https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual and around there, unless there is a newer version in LLVM GitHub somewhere -- but either way not the very short clang summary docs https://clang.llvm.org/docs/ThreadSanitizer.html) too, in case something comes up.
    • Contacting the devs (which they encourage in the above wiki) may well be helpful, if the above does not help.
    • Be on the lookout for any pathologically evil stuff we might be doing in unit_test or the real code; maybe we do not clean up some locks or threads properly? (Honestly I doubt it. Do we start threads, yes, but we join them early and often too. Maybe ultimately it really is just a TSAN limitation, and there's nothing we can really do.)

It is worth looking into, but it is not a hair-on-fire problem. We can skip one test w/r/t to TSAN and survive.

@ygoldfeld ygoldfeld added bug Something isn't working from-akamai-pre-open Issue origin is Akamai, before opening source test Unit and functional tests; demo/example programs labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working from-akamai-pre-open Issue origin is Akamai, before opening source test Unit and functional tests; demo/example programs
Projects
None yet
Development

No branches or pull requests

1 participant