Skip to content

Fix: Double build time limit since #5027 halfs NUM_JOBS #5212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion jenkins/Build.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ LLM_DOCKER_IMAGE = env.dockerImage

AGENT_IMAGE = env.dockerImage

POD_TIMEOUT_SECONDS = env.podTimeoutSeconds ? env.podTimeoutSeconds : "21600"
POD_TIMEOUT_SECONDS = env.podTimeoutSeconds ? env.podTimeoutSeconds : "43200"
Copy link
Collaborator

@chzblych chzblych Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a long-term solution. We need to think about how to improve the compilation efficiency. Otherwise, it would become painful for the dependency update or massive CPP file change in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the original problem was OOM right?
In the past we found the MOE kernels use upward of 10GiB/file. We could add a CI compile mode that groups them all in to one file (which uses about the same memory footprint) even though the time to compile will be longer for the one file it will allow us to increase the threads again for everything else. And for regular developer we can remove this and allow them to use full parallelism if their dev environment can handle it.

This assumes that MOE is the main culprit, though this will require some investigation to see if there are any other low hanging fruit that we can fix too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yunruis for vis.

I will merge this to unblock since when we do the dependency upgrade, full compilation is needed.

Copy link
Contributor

@yunruis yunruis Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is because OOM. I met the OOM error during CI compilation nearly every time. I asked CI infra @ZhanruiSunCh for help, thus changed build num from 8 --> 4. If it revert to 4 without other compiling optimization, I guess the later developer would meet CI OOM error, too.
The solution I could find, is to enlarge the cpu memory on CI machine

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think OOM is the original issue for reducing the NUM_JOBS by half.

In the CI pipeline, we create a CPU pod with up to 64 GB memory. The reason is that the CPU hardware resources are shared by multiple teams. On the contrary, the GPU hardware resources are dedicated by our team. We need to keep a reasonable CPU pod resource usage to avoid affecting other teams.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also know there are options like ninja build pools, as a first step perhaps if we found a way to assign all the high memory consuming compilation steps to a single pool we could quickly reduce the memory footprint without having to change any code


// Literals for easier access.
@Field
Expand Down