Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moved Egamma TensorFlow sessions to global cache #46655

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

valsdav
Copy link
Contributor

@valsdav valsdav commented Nov 11, 2024

PR description:

Solves #46494. Related to #46040

EgammaDNNHelper was storing TF graphs globally by job, but TF sessions were owned my the GSFElectronProducer and GEDPhotonProducers by stream.
This PR moves the TFSessions in the EgammaDNNHelper, making them global by job.

PR validation:

Purely technical PR, no changes expected.

@valsdav
Copy link
Contributor Author

valsdav commented Nov 11, 2024

assign ml

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 11, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

New categories assigned: ml

@valsdav,@y19y19 you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @valsdav for master.

It involves the following packages:

  • RecoEgamma/EgammaElectronProducers (reconstruction)
  • RecoEgamma/EgammaPhotonProducers (reconstruction)
  • RecoEgamma/EgammaTools (reconstruction)
  • RecoEgamma/ElectronIdentification (reconstruction)
  • RecoEgamma/PhotonIdentification (reconstruction)

@cmsbuild, @jfernan2, @mandrenguyen, @valsdav, @y19y19 can you please review it and eventually sign? Thanks.
@Prasant1993, @Sam-Harper, @a-kapoor, @afiqaize, @jainshilpi, @lgray, @missirol, @ram1123, @rovere, @sameasy, @sobhatta, @varuns23 this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@valsdav
Copy link
Contributor Author

valsdav commented Nov 11, 2024

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests RelVals RelVals-INPUT AddOn
Size: This PR adds an extra 44KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fe9c2/42719/summary.html
COMMIT: 90d7ded
CMSSW: CMSSW_14_2_X_2024-11-11-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46655/42719/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 4 errors in the following unit tests:

---> test test_MC_22_crosscheck had ERRORS
---> test test_MC_23_crosscheck had ERRORS
---> test test_MC_23_setup had ERRORS
and more ...

RelVals

  • 139.001A fatal system signal has occurred: segmentation violation
  • 2022.000001A fatal system signal has occurred: segmentation violation
  • 2022.002001A fatal system signal has occurred: segmentation violation
Expand to see more relval errors ...

RelVals-INPUT

  • 138.4138.4_PromptCollisions2021/step2_PromptCollisions2021.log
  • 138.5138.5_ExpressCollisions2021/step2_ExpressCollisions2021.log
  • 2024.0020012024.002001_RunJetMET02024B_10k/step1_dasquery.log
Expand to see more relval errors ...

AddOn Tests

A fatal system signal has occurred: segmentation violation
A fatal system signal has occurred: segmentation violation
A fatal system signal has occurred: segmentation violation
Expand to see more addon errors ...

@valsdav
Copy link
Contributor Author

valsdav commented Nov 11, 2024

please test

@valsdav
Copy link
Contributor Author

valsdav commented Nov 11, 2024

please abort test

@valsdav
Copy link
Contributor Author

valsdav commented Nov 11, 2024

please test

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #46655 was updated. @jfernan2, @mandrenguyen, @valsdav, @y19y19 can you please check and sign again.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build ClangBuild
Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fe9c2/42736/summary.html
COMMIT: 83d1d98
CMSSW: CMSSW_14_2_X_2024-11-11-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46655/42736/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation warning when building: See details on the summary page.

Clang Build

I found compilation error while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'

>> Entering Package RecoEgamma/EgammaPhotonProducers
>> Entering Package RecoEgamma/EgammaTools
>> Entering Package RecoEgamma/ElectronIdentification
>> Entering Package RecoEgamma/PhotonIdentification
>> Compile sequence completed for CMSSW CMSSW_14_2_X_2024-11-11-1100
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 1
+ eval scram build outputlog '&&' '(python3' /data/cmsbld/jenkins/workspace/ib-run-pr-tests/cms-bot/buildLogAnalyzer.py --logDir /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_14_2_X_2024-11-11-1100/tmp/el8_amd64_gcc12/cache/log/src '||' 'true)'
++ scram build outputlog
>> Entering Package RecoEgamma/EgammaElectronProducers
Entering library rule at src/RecoEgamma/EgammaElectronProducers/plugins
>> Compiling edm plugin src/RecoEgamma/EgammaElectronProducers/plugins/ElectronNHitSeedProducer.cc


EgammaDNNHelper was storing TF graphs globally by job, but TF sessions were owned my the GSFElectronProducer and GEDPhotonProducers.
This commit moves the TFSessions in the EgammaDNNHelper, making them global by job.
@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #46655 was updated. @cmsbuild, @jfernan2, @mandrenguyen, @valsdav, @y19y19 can you please check and sign again.

@valsdav
Copy link
Contributor Author

valsdav commented Nov 11, 2024

please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fe9c2/42744/summary.html
COMMIT: 22246a6
CMSSW: CMSSW_14_2_X_2024-11-11-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46655/42744/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@valsdav
Copy link
Contributor Author

valsdav commented Nov 12, 2024

Hi @makortel, I would like to reproduce the memory profile of #46040 to compare the memory usage. Would you have any script to share to reproduce the same igProf memory profile that's in the issue? Thanks!

@makortel
Copy link
Contributor

Below is an untested recipe to reproduce what I did for #46494

# Set up developer area
cmssw-el7
export SCRAM_ARCH=slc7_amd64_gcc12
cmsrel CMSSW_14_0_16
cd CMSSW_14_0_16/src
cmsenv

# Set up the configuration
cp /eos/home-c/cmst0/public/PausedJobs/Run2024G/maxPSS/PromptReco_Run386319_Muon1/job/WMTaskSpace/cmsRun1/PSet.p* .
xrdcp root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024H/Muon1/RAW/v1/000/386/319/00000/27604955-1005-455b-90eb-8195e2f02cb8.root .
cat >>PSet.py <<EOF
process.source.fileNames = ["file:27604955-1005-455b-90eb-8195e2f02cb8.root"]
process.maxEvents.input = 20
process.options.numberOfThreads = 4
process.options.numberOfStreams = 4
process.IgProfService = cms.Service('IgProfService',
    reportFirstEvent = cms.untracked.int32(0),
    reportEventInterval = cms.untracked.int32(10),
    reportToFileAtPostEvent     = cms.untracked.string('| gzip -c > 4th_4st_07.mem.%I.gz')
)
EOF

# Run IgProf memory profiler
igprof -d -t cmsRunGlibC -z -mp -o test_07.mem.gz cmsRunGlibC PSet.py > out.txt 2>&1
igprof-analyse --sqlite -d -v -g -r MEM_LIVE test_4th_4st_07.mem.20.gz | sqlite3 test_4th_4st_07.20_live.sql3

# Make the memory profile web browseable (needs a web area with cgi-bin access)
cp (which igprof-navigator) <web dir>/cgi-bin
mkdir <web dir>/cgi-bin/data
cp test_4th_4st_07.20_live.sql3 <web dir>/cgi-bin/data

Open web browser to <web dir>/cgi-bin/igprof-navigator/test_4th_4st_07.20_live.

@valsdav
Copy link
Contributor Author

valsdav commented Nov 13, 2024

Thanks a lot! Unfortunately the data file is no more available, can I just use another one?

@@ -65,6 +66,7 @@ namespace egammaTools {
std::vector<uint> nInputs_;

std::vector<std::unique_ptr<const tensorflow::GraphDef>> graphDefs_;
std::vector<std::unique_ptr<tensorflow::Session>> sessions_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming tensor flow::Session is const thread-safe this should be std::vector<std::unique_ptr<const tensorflow::Session>> as it is used across threads concurrently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC tensorflow::Session interface is not const, even if it is thread safe and does not mutate the Session object (i.e. in principle the interface could be const) :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants