-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix: Double build time limit since #5027 halfs NUM_JOBS
#5212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/bot run |
PR_Github #8867 [ run ] triggered by Bot |
PR_Github #8867 [ run ] completed with state |
Signed-off-by: Tailing Yuan <[email protected]>
19cedeb
to
93e7641
Compare
Rebase main and try again. |
/bot run |
PR_Github #8874 [ run ] triggered by Bot |
PR_Github #8874 [ run ] completed with state |
@niukuo @EmmaQiaoCh Hi Yiteng, Emma, Can you review this change about the build timeout setting? June |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good as a temporary solution. How often does a full rebuild happen?
I am happy to help consult on a better solution, as this will be painful whenever there are changes that impact the MOE launchers as they take ~20 minutes each last I measured so this will likely double the build time in those cases. We have the generate_kernels.py
script that we can change to control how the instantiations are grouped, we should perhaps try to reduce this <4 files as the increase in time/memory is sublinear when merging instantiations
There are two data samples:
|
I actually expect the build to more than double, the internal cutlass build takes 1 hour with 16 jobs. So we are potentially adding 1 hour straight away even before doubling the build time |
@@ -18,7 +18,7 @@ LLM_DOCKER_IMAGE = env.dockerImage | |||
|
|||
AGENT_IMAGE = env.dockerImage | |||
|
|||
POD_TIMEOUT_SECONDS = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600" | |||
POD_TIMEOUT_SECONDS = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a long-term solution. We need to think about how to improve the compilation efficiency. Otherwise, it would become painful for the dependency update or massive CPP file change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the original problem was OOM right?
In the past we found the MOE kernels use upward of 10GiB/file. We could add a CI compile mode that groups them all in to one file (which uses about the same memory footprint) even though the time to compile will be longer for the one file it will allow us to increase the threads again for everything else. And for regular developer we can remove this and allow them to use full parallelism if their dev environment can handle it.
This assumes that MOE is the main culprit, though this will require some investigation to see if there are any other low hanging fruit that we can fix too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yunruis for vis.
I will merge this to unblock since when we do the dependency upgrade, full compilation is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is because OOM. I met the OOM error during CI compilation nearly every time. I asked CI infra @ZhanruiSunCh for help, thus changed build num from 8 --> 4. If it revert to 4 without other compiling optimization, I guess the later developer would meet CI OOM error, too.
The solution I could find, is to enlarge the cpu memory on CI machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think OOM is the original issue for reducing the NUM_JOBS by half.
In the CI pipeline, we create a CPU pod with up to 64 GB memory. The reason is that the CPU hardware resources are shared by multiple teams. On the contrary, the GPU hardware resources are dedicated by our team. We need to keep a reasonable CPU pod resource usage to avoid affecting other teams.
Fix: Double build time limit since #5027 halfs
NUM_JOBS
Description
Build time has increased up to 2x because PR #5027 changes
BUILD_JOBS
from 8 to 4.The
Build-x86_64
job hit the six-hour timeout and was aborted: https://prod.blsm.nvidia.com/sw-tensorrt-top-1/blue/organizations/jenkins/LLM%2Fhelpers%2FBuild-x86_64/detail/Build-x86_64/19703/pipeline/100In that run, the step
bash -c 'cd llm && python3 scripts/build_wheel.py --use_ccache -j 4 -D '\''WARNING_IS_ERROR=ON'\'' -a '\''80-real;86-real;89-real;90-real;100-real;120-real'\'' -l -D '\''USE_CXX11_ABI=1'\'''
took 5h58m27s before it was terminated.Although ccache can speed up incremental builds, it offers no benefit when the Docker image changes or the cache is cold. Therefore, the build timeout should be set according to the worst-case build time.
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]
Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.