Skip to content

Fix: Double build time limit since #5027 halfs NUM_JOBS #5212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yuantailing
Copy link
Collaborator

Fix: Double build time limit since #5027 halfs NUM_JOBS

Description

Build time has increased up to 2x because PR #5027 changes BUILD_JOBS from 8 to 4.

The Build-x86_64 job hit the six-hour timeout and was aborted: https://prod.blsm.nvidia.com/sw-tensorrt-top-1/blue/organizations/jenkins/LLM%2Fhelpers%2FBuild-x86_64/detail/Build-x86_64/19703/pipeline/100
In that run, the step bash -c 'cd llm && python3 scripts/build_wheel.py --use_ccache -j 4 -D '\''WARNING_IS_ERROR=ON'\'' -a '\''80-real;86-real;89-real;90-real;100-real;120-real'\'' -l -D '\''USE_CXX11_ABI=1'\''' took 5h58m27s before it was terminated.

Although ccache can speed up incremental builds, it offers no benefit when the Docker image changes or the cache is cold. Therefore, the build timeout should be set according to the worst-case build time.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yuantailing yuantailing requested a review from kaiyux June 14, 2025 01:12
@yuantailing
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8867 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8867 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6454 completed with status: 'ABORTED'

@yuantailing
Copy link
Collaborator Author

Rebase main and try again.

@yuantailing
Copy link
Collaborator Author

/bot run

@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jun 14, 2025
@NVIDIA NVIDIA deleted a comment from tensorrt-cicd Jun 14, 2025
@tensorrt-cicd
Copy link
Collaborator

PR_Github #8874 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #8874 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6461 completed with status: 'FAILURE'

@juney-nvidia
Copy link
Collaborator

@niukuo @EmmaQiaoCh Hi Yiteng, Emma,

Can you review this change about the build timeout setting?

June

@kaiyux kaiyux requested review from chzblych and djns99 and removed request for djns99 June 14, 2025 11:16
Copy link
Collaborator

@djns99 djns99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good as a temporary solution. How often does a full rebuild happen?

I am happy to help consult on a better solution, as this will be painful whenever there are changes that impact the MOE launchers as they take ~20 minutes each last I measured so this will likely double the build time in those cases. We have the generate_kernels.py script that we can change to control how the instantiations are grouped, we should perhaps try to reduce this <4 files as the increase in time/memory is sublinear when merging instantiations

@yuantailing
Copy link
Collaborator Author

yuantailing commented Jun 16, 2025

How often does a full rebuild happen?

There are two data samples:

  1. Docker image change: 4h15m24s with NUM_JOBS=8. https://prod.blsm.nvidia.com/sw-tensorrt-top-1/blue/organizations/jenkins/LLM%2Fhelpers%2FBuild-x86_64/detail/Build-x86_64/19333/pipeline/128
  2. Merge 31 commits from main branch, including cutlass change: >5h58m27s (timeout) with NUM_JOBS=4. https://prod.blsm.nvidia.com/sw-tensorrt-top-1/blue/organizations/jenkins/LLM%2Fhelpers%2FBuild-x86_64/detail/Build-x86_64/19703/pipeline/100

@djns99
Copy link
Collaborator

djns99 commented Jun 16, 2025

Cutlass change: >5h58m27s

I actually expect the build to more than double, the internal cutlass build takes 1 hour with 16 jobs. So we are potentially adding 1 hour straight away even before doubling the build time

@@ -18,7 +18,7 @@ LLM_DOCKER_IMAGE = env.dockerImage

AGENT_IMAGE = env.dockerImage

POD_TIMEOUT_SECONDS = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
POD_TIMEOUT_SECONDS = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"
Copy link
Collaborator

@chzblych chzblych Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a long-term solution. We need to think about how to improve the compilation efficiency. Otherwise, it would become painful for the dependency update or massive CPP file change in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the original problem was OOM right?
In the past we found the MOE kernels use upward of 10GiB/file. We could add a CI compile mode that groups them all in to one file (which uses about the same memory footprint) even though the time to compile will be longer for the one file it will allow us to increase the threads again for everything else. And for regular developer we can remove this and allow them to use full parallelism if their dev environment can handle it.

This assumes that MOE is the main culprit, though this will require some investigation to see if there are any other low hanging fruit that we can fix too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yunruis for vis.

I will merge this to unblock since when we do the dependency upgrade, full compilation is needed.

Copy link
Contributor

@yunruis yunruis Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is because OOM. I met the OOM error during CI compilation nearly every time. I asked CI infra @ZhanruiSunCh for help, thus changed build num from 8 --> 4. If it revert to 4 without other compiling optimization, I guess the later developer would meet CI OOM error, too.
The solution I could find, is to enlarge the cpu memory on CI machine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think OOM is the original issue for reducing the NUM_JOBS by half.

In the CI pipeline, we create a CPU pod with up to 64 GB memory. The reason is that the CPU hardware resources are shared by multiple teams. On the contrary, the GPU hardware resources are dedicated by our team. We need to keep a reasonable CPU pod resource usage to avoid affecting other teams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants