Fix: Double build time limit since #5027 halfs `NUM_JOBS` #5212

yuantailing · 2025-06-14T01:12:28Z

Fix: Double build time limit since #5027 halfs `NUM_JOBS`

Description

Build time has increased up to 2x because PR #5027 changes BUILD_JOBS from 8 to 4.

The Build-x86_64 job hit the six-hour timeout and was aborted: https://prod.blsm.nvidia.com/sw-tensorrt-top-1/blue/organizations/jenkins/LLM%2Fhelpers%2FBuild-x86_64/detail/Build-x86_64/19703/pipeline/100
In that run, the step bash -c 'cd llm && python3 scripts/build_wheel.py --use_ccache -j 4 -D '\''WARNING_IS_ERROR=ON'\'' -a '\''80-real;86-real;89-real;90-real;100-real;120-real'\'' -l -D '\''USE_CXX11_ABI=1'\''' took 5h58m27s before it was terminated.

Although ccache can speed up incremental builds, it offers no benefit when the Docker image changes or the cache is cold. Therefore, the build timeout should be set according to the worst-case build time.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

yuantailing · 2025-06-14T01:42:52Z

/bot run

tensorrt-cicd · 2025-06-14T01:49:07Z

PR_Github #8867 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-14T01:52:56Z

PR_Github #8867 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #6454 completed with status: 'ABORTED'

Signed-off-by: Tailing Yuan <[email protected]>

yuantailing · 2025-06-14T04:45:32Z

Rebase main and try again.

yuantailing · 2025-06-14T04:45:38Z

/bot run

tensorrt-cicd · 2025-06-14T04:52:09Z

PR_Github #8874 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-14T08:18:45Z

PR_Github #8874 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #6461 completed with status: 'FAILURE'

juney-nvidia · 2025-06-14T08:46:46Z

@niukuo @EmmaQiaoCh Hi Yiteng, Emma,

Can you review this change about the build timeout setting?

June

djns99

This is good as a temporary solution. How often does a full rebuild happen?

I am happy to help consult on a better solution, as this will be painful whenever there are changes that impact the MOE launchers as they take ~20 minutes each last I measured so this will likely double the build time in those cases. We have the generate_kernels.py script that we can change to control how the instantiations are grouped, we should perhaps try to reduce this <4 files as the increase in time/memory is sublinear when merging instantiations

yuantailing · 2025-06-16T02:40:35Z

How often does a full rebuild happen?

There are two data samples:

Docker image change: 4h15m24s with NUM_JOBS=8. https://prod.blsm.nvidia.com/sw-tensorrt-top-1/blue/organizations/jenkins/LLM%2Fhelpers%2FBuild-x86_64/detail/Build-x86_64/19333/pipeline/128
Merge 31 commits from main branch, including cutlass change: >5h58m27s (timeout) with NUM_JOBS=4. https://prod.blsm.nvidia.com/sw-tensorrt-top-1/blue/organizations/jenkins/LLM%2Fhelpers%2FBuild-x86_64/detail/Build-x86_64/19703/pipeline/100

djns99 · 2025-06-16T02:48:12Z

Cutlass change: >5h58m27s

I actually expect the build to more than double, the internal cutlass build takes 1 hour with 16 jobs. So we are potentially adding 1 hour straight away even before doubling the build time

chzblych · 2025-06-16T02:55:35Z

jenkins/Build.groovy

@@ -18,7 +18,7 @@ LLM_DOCKER_IMAGE = env.dockerImage

 AGENT_IMAGE = env.dockerImage

-POD_TIMEOUT_SECONDS = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
+POD_TIMEOUT_SECONDS = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"


I don't think this is a long-term solution. We need to think about how to improve the compilation efficiency. Otherwise, it would become painful for the dependency update or massive CPP file change in the future.

I believe the original problem was OOM right?
In the past we found the MOE kernels use upward of 10GiB/file. We could add a CI compile mode that groups them all in to one file (which uses about the same memory footprint) even though the time to compile will be longer for the one file it will allow us to increase the threads again for everything else. And for regular developer we can remove this and allow them to use full parallelism if their dev environment can handle it.

This assumes that MOE is the main culprit, though this will require some investigation to see if there are any other low hanging fruit that we can fix too.

@yunruis for vis.

I will merge this to unblock since when we do the dependency upgrade, full compilation is needed.

Yes it is because OOM. I met the OOM error during CI compilation nearly every time. I asked CI infra @ZhanruiSunCh for help, thus changed build num from 8 --> 4. If it revert to 4 without other compiling optimization, I guess the later developer would meet CI OOM error, too.
The solution I could find, is to enlarge the cpu memory on CI machine

Yeah, I think OOM is the original issue for reducing the NUM_JOBS by half.

In the CI pipeline, we create a CPU pod with up to 64 GB memory. The reason is that the CPU hardware resources are shared by multiple teams. On the contrary, the GPU hardware resources are dedicated by our team. We need to keep a reasonable CPU pod resource usage to avoid affecting other teams.

yuantailing requested a review from kaiyux June 14, 2025 01:12

Double build time since NVIDIA#5027 half NUM_JOBS

93e7641

Signed-off-by: Tailing Yuan <[email protected]>

yuantailing force-pushed the fix-build-timeout branch from 19cedeb to 93e7641 Compare June 14, 2025 04:45

NVIDIA deleted a comment from tensorrt-cicd Jun 14, 2025

juney-nvidia requested review from niukuo and EmmaQiaoCh June 14, 2025 08:46

kaiyux requested review from chzblych and djns99 and removed request for djns99 June 14, 2025 11:16

djns99 approved these changes Jun 16, 2025

View reviewed changes

chzblych reviewed Jun 16, 2025

View reviewed changes

juney-nvidia approved these changes Jun 16, 2025

View reviewed changes

Fix: Double build time limit since #5027 halfs NUM_JOBS #5212

Are you sure you want to change the base?

Fix: Double build time limit since #5027 halfs NUM_JOBS #5212

Conversation

yuantailing commented Jun 14, 2025

Fix: Double build time limit since #5027 halfs NUM_JOBS

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

yuantailing commented Jun 14, 2025

Uh oh!

tensorrt-cicd commented Jun 14, 2025

Uh oh!

tensorrt-cicd commented Jun 14, 2025

Uh oh!

yuantailing commented Jun 14, 2025

Uh oh!

yuantailing commented Jun 14, 2025

Uh oh!

tensorrt-cicd commented Jun 14, 2025

Uh oh!

tensorrt-cicd commented Jun 14, 2025

Uh oh!

juney-nvidia commented Jun 14, 2025

Uh oh!

djns99 left a comment

Choose a reason for hiding this comment

Uh oh!

yuantailing commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

djns99 commented Jun 16, 2025

Uh oh!

chzblych Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

djns99 Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

juney-nvidia Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

yunruis Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chzblych Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fix: Double build time limit since #5027 halfs `NUM_JOBS` #5212

Fix: Double build time limit since #5027 halfs `NUM_JOBS` #5212

Fix: Double build time limit since #5027 halfs `NUM_JOBS`

yuantailing commented Jun 16, 2025 •

edited

Loading

chzblych Jun 16, 2025 •

edited

Loading

yunruis Jun 16, 2025 •

edited

Loading