-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1934: Add parameter to control minimal retention of historical LB data #1996
base: develop
Are you sure you want to change the base?
#1934: Add parameter to control minimal retention of historical LB data #1996
Conversation
Pipelines resultsPR tests (gcc-12, ubuntu, mpich) Build for a8ff97d (2024-01-30 17:22:21 UTC)
PR tests (clang-3.9, ubuntu, mpich) Build for 0b5ee2e (2022-11-11 13:45:39 UTC)
PR tests (gcc-10, ubuntu, openmpi, no LB) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (gcc-5, ubuntu, mpich) Build for 0b5ee2e (2022-11-11 13:45:39 UTC)
PR tests (gcc-6, ubuntu, mpich) Build for 0b5ee2e (2022-11-11 13:45:39 UTC)
PR tests (gcc-9, ubuntu, mpich, zoltan) Build for 4e93971 (2023-03-28 10:16:37 UTC)
PR tests (clang-5.0, ubuntu, mpich) Build for 0b5ee2e (2022-11-11 13:45:39 UTC)
PR tests (gcc-7, ubuntu, mpich, trace runtime, LB) Build for 4e93971 (2023-03-28 10:16:37 UTC)
PR tests (nvidia cuda 11.0, ubuntu, mpich) Build for dec4a3e (2023-04-04 10:47:19 UTC)
PR tests (nvidia cuda 10.1, ubuntu, mpich) Build for 0b5ee2e (2022-11-11 13:45:39 UTC)
PR tests (clang-9, ubuntu, mpich) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (clang-13, alpine, mpich) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (gcc-8, ubuntu, mpich, address sanitizer) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (clang-11, ubuntu, mpich) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (clang-12, ubuntu, mpich) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (clang-13, ubuntu, mpich) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (intel icpx, ubuntu, mpich) Build for 0b5ee2e (2022-11-11 13:45:39 UTC)
PR tests (clang-14, ubuntu, mpich) Build for a8ff97d (2024-01-30 17:22:21 UTC)
PR tests (clang-10, ubuntu, mpich) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (gcc-11, ubuntu, mpich, json schema test) Build for 4e93971 (2023-03-28 10:16:37 UTC)
PR tests (intel icpc, ubuntu, mpich) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (nvidia cuda 11.2, ubuntu, mpich) Build for dec4a3e (2023-04-04 10:47:19 UTC)
PR tests (gcc-9, ubuntu, mpich, zoltan, json schema test) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (gcc-11, ubuntu, mpich, trace runtime, coverage) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (nvidia cuda 11.2, gcc-9, ubuntu, mpich) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (gcc-12, ubuntu, mpich, verbose) Build for 4be442c (2024-06-12 13:05:01 UTC)
PR tests (intel icpx, ubuntu, mpich, verbose) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (clang-14, ubuntu, mpich, verbose) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (nvidia cuda 12.2.0, gcc-9, ubuntu, mpich, verbose) Build for 0b2489f (2024-09-26 16:27:09 UTC)
PR tests (gcc-12, ubuntu, mpich, verbose, kokkos) Build for 0b2489f (2024-09-26 16:27:09 UTC)
|
84321c3
to
1fa5810
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1996 +/- ##
===========================================
- Coverage 85.48% 84.93% -0.55%
===========================================
Files 722 723 +1
Lines 25907 25662 -245
===========================================
- Hits 22146 21796 -350
- Misses 3761 3866 +105
|
Other than the minor interface thing I noted, I'd approve this. |
181ce43
to
1422bbc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ready to go.
Could you think through whether there are any weird concerns if a new model gets set and that changes the number of phases retained? In particular, I can think of the possibility that we keep some stale data forever if a new model demands fewer phases than its predecessor.
@PhilMiller Sorry for the delay, I missed your message. From what I understand the scenario that you described is the only one which can have some weird behaviors. LB data is being cleaned after each phase so that will occur rather frequently. So if there is more data than model needs then it will not be retained for long. |
1422bbc
to
0b5ee2e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my latest comment about covering the edge case
The reason I see the potential for retention is that the cleaning removes data from a particular past phase at a fixed offset from the current phase. if the model's requested retention falls, then there will be a range of phases whose data gets skipped over. |
0b5ee2e
to
eee6783
Compare
…the containers rather than clearing their contents.
…cularPhasesBuffer
87f126c
to
ddc3c57
Compare
proxy_new.broadcastCollective<TestCol::colHandler>(); | ||
}); | ||
// Go to the next phase. | ||
vt::thePhase()->nextPhaseCollective(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we reset the current phase to 0 in restoreFromFileInPlace
, then this line will cause vt to abort due to having different phases in LBData and in the stats message:
vtAssert(lb_data.getPhase() == msg->getPhase(), "Phases must match"); |
@lifflander / @nlslatt - Do we need to reset the current phase to 0 in restoreFromFileInPlace
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my recent testing, it appears that the resetPhase
method is essential for the proper functioning of VT.
My previous tests were successful because I did not fully recreate VT, which resulted in the preservation of PhaseManager
, NodeLBData
, and other VT modules.
When we create a new fresh instance and restore the collection from the checkpoint, every module will commence execution from phase 0.
Upon restoring the collection from the checkpoint, we will retain only one phase of history, even if multiple phases are serialized. Upon restoration, the current phase is reset to 0, rendering other phases inaccessible.
Closes #1934