Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a long-term solution. We need to think about how to improve the compilation efficiency. Otherwise, it would become painful for the dependency update or massive CPP file change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the original problem was OOM right?
In the past we found the MOE kernels use upward of 10GiB/file. We could add a CI compile mode that groups them all in to one file (which uses about the same memory footprint) even though the time to compile will be longer for the one file it will allow us to increase the threads again for everything else. And for regular developer we can remove this and allow them to use full parallelism if their dev environment can handle it.
This assumes that MOE is the main culprit, though this will require some investigation to see if there are any other low hanging fruit that we can fix too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yunruis for vis.
I will merge this to unblock since when we do the dependency upgrade, full compilation is needed.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is because OOM. I met the OOM error during CI compilation nearly every time. I asked CI infra @ZhanruiSunCh for help, thus changed build num from 8 --> 4. If it revert to 4 without other compiling optimization, I guess the later developer would meet CI OOM error, too.
The solution I could find, is to enlarge the cpu memory on CI machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think OOM is the original issue for reducing the NUM_JOBS by half.
In the CI pipeline, we create a CPU pod with up to 64 GB memory. The reason is that the CPU hardware resources are shared by multiple teams. On the contrary, the GPU hardware resources are dedicated by our team. We need to keep a reasonable CPU pod resource usage to avoid affecting other teams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also know there are options like ninja build pools, as a first step perhaps if we found a way to assign all the high memory consuming compilation steps to a single pool we could quickly reduce the memory footprint without having to change any code