Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the caching memory allocator lock-free #46658

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Nov 11, 2024

PR description:

Move the live block descriptors to the alpaka buffers: instead of tracking the descriptors for the live blocks in a global map, store each block's descriptor within the deleter of the block itself.

Reimplement the free list as an std::vector of tbb::concurrent_queue objects, one per
bin of the caching allocator.
Since a block may be in the free list but not actually be available for reuse, blocks are popped from the queue until an available block is found and reused, or the queue is empty and new block is requested. Then all popped blocks are pushed back to the queue.

Use atomic operations for the individual statistics.
Access to the whole set of statistics may not be fully consistent, but they should only be used for debugging or monitoring.

PR validation:

Validated on top of CMSSW 14.0.x with the 2024 HLT menu.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 11, 2024

cms-bot internal usage

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 11, 2024

With CMSSW_14_0_15_patch1, running the HLT with 128 threads we see:
vtune_threading-notaus_hlt-reference-x86-64-v3-1x128t128s

With these changes, the contention moves to the CUDA mutex (to be addressed by further improvements):
vtune_threading-notaus_hlt-lockfree-x86-64-v3-1x128t128s

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46658/42578

Code check has found code style and quality issues which could be resolved by applying following patch(s)

Move the live block descriptors to the alpaka buffers: instead of tracking the
descriptors for the live blocks in a global map, store each block's descriptor
within the deleter of the block itself.

Reimplement the free list as a vector of tbb::concurrent_queue objects, one per
bin of the caching allocator.
Since a block may be in the free list but not actually be available for reuse,
blocks are popped from the queue until an available block is found and reused,
or the queue is empty and new block is requested. Then all popped blocks are
pushed back to the queue.

Use atomic operations for the individual statistics.
Access to the whole set of statistics may not be fully consistent, but they
should only be used for debugging or monitoring.
@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard for master.

It involves the following packages:

  • HeterogeneousCore/AlpakaInterface (heterogeneous)

@cmsbuild, @fwyzard, @makortel can you please review it and eventually sign? Thanks.
@makortel, @missirol, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 11, 2024

@makortel, I split the implementation of the CachingAllocator into header and source file to ease the development. In principle I can move back to a header-only implementation, if that's better ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 11, 2024

@makortel I would like to make some other clean up in the code, but first I wanted to see if you have any comments on this implementation.

Then I can push the follow up changes here, or in a separate PR.

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 11, 2024

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 11, 2024

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 11, 2024

@Dr15Jones I would appreciate any feedback from you as well :-)

@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build
Size: This PR adds an extra 28KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-49dcac/42726/summary.html
COMMIT: 3dd76b4
CMSSW: CMSSW_14_2_X_2024-11-11-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46658/42726/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

In file included from src/HeterogeneousCore/AlpakaInterface/interface/CachedBufAlloc.h:6,
                 from src/HeterogeneousCore/AlpakaInterface/interface/memory.h:9,
                 from src/DataFormats/Portable/interface/PortableHostCollection.h:11,
                 from src/DataFormats/Portable/interface/PortableCollection.h:6,
                 from src/DataFormats/Portable/test/test_catch2_portableCollectionOnHost.cc:3:
src/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h:11:10: fatal error: tbb/concurrent_queue.h: No such file or directory
   11 | #include 
      |          ^~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
In file included from src/HeterogeneousCore/AlpakaInterface/interface/CachedBufAlloc.h:6,
                 from src/HeterogeneousCore/AlpakaInterface/interface/memory.h:9,


@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 11, 2024

please test with #46657

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 12KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-49dcac/42740/summary.html
COMMIT: 3dd76b4
CMSSW: CMSSW_14_2_X_2024-11-11-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46658/42740/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 46
  • DQMHistoTests: Total histograms compared: 3343588
  • DQMHistoTests: Total failures: 419
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3343149
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 45 files compared)
  • Checked 202 log files, 172 edm output root files, 46 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53031
  • DQMHistoTests: Total failures: 74
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 52957
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants