-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple Github actions runners on the same machine #158
base: main
Are you sure you want to change the base?
Conversation
xshell::cmd!(sh, "sudo apt-get update").run()?; | ||
// retry on failure in case another process has a lock on /var/lib/apt/lists/lock | ||
// DPkg::Lock::Timeout doesn't seem to work for apt-get update | ||
xshell::cmd!(sh, "bash -c 'i=0; while [ $i -lt 5 ] && ! sudo apt-get update; do let \"i=i+1\"; sleep 1; done;'").run()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to do a bash loop here - you're already writing Rust, so just use a Rust for loop + std::thread::sleep
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the one nit, LGTM
This way lies madness. If we need to support multiple concurrent jobs on the same physical machine, then we need to run those jobs in containers or VMs. |
John brings up a good point, and its one I probably should've realized sooner when discussing this work with you... Lets hold off on this PR until we align on the broader goals we're trying to achieve here |
The idea here is to make better use of the physical ARM test machines that we have, since nested virtualization is not supported yet. I've tested this with 4 concurrent runners on the same machine and it works well. Also, this might help mitigate some occasional failures on our current 1ES CI machines due to apt. Also, I think these changes are a good idea whether or not we have multiple runners on the same machine, in case there are other processes running on them. |
The apt changes are def pure goodness, and whether its as part of this PR or a new PR, we should make sure those land. While it's certainly a bit suspect that we keep seeing those dpkg lock failures on CI machines that are ostensibly wiped between runs, it does seems like a common enough issue reported my many folks that I'm willing to chalk it up to CI weirdness, and work around it.
I think this falls under the umbrella of "this way lies madness". We wouldn't want this happening in CI, and on a user's local box, I think its reasonable to assume folks are rational actors aren't trying to mutiplex system-wide tool installs on a regular basis
Ahh... is this so that the same ARM host box can run multiple VMM test jobs from multiple pipeline run submissions? |
Yes this is the primary motivation for this change. We could go without it, but CI will be slower. |
We're going to cross compile from x64, right? So this is just for unit and VMM tests. Can we just up the concurrency limits on ARM tests to better utilize the machines (and free them up the for the next user sooner), rather than try to run multiple jobs concurrently? |
For Linux, we could also run in containers. For Windows that isn't an option, really, due to the lack of WHP support in Windows containers. (Maybe we should get that fixed.) |
We could cross compile from x64, but the machines have lots of extra resources, so I was thinking of using them for some of the compilation jobs as well. Those jobs could use VMs or containers though, even on Windows. The unit tests probably also don't need virtualization support, so this would really just be for VMM tests. Increase concurrency limits might help a little in the future, but right now most of the time is spent doing things that aren't easily parallelized within the context of a single job. |
@jstarks, trevor shared some of the preliminary numbers from running build jobs on the ARM boxes, and the results are really good. We are talking 2x improvements over the existing 1ES runners. I'm all for leveraging them for our compile jobs, including for cross-compiling x86 from ARM. @tjones60, can we improve VMM test execution speed by caching VHDs on a shared persistent disk? there'd be some flowey tweaks to get |
OK, great. Then we can even containers on Windows for the build steps. Concurrent jobs on the same OS instance is not something we want to do. |
Maybe we should remove the retry loop for the local case (does this run in the local case?) |
It does run in the local case, where it'll prompt the user to install required packages during I think its reasonable to only run once when using the local backend, and have it run N times on ADO / GH, with a clear comment explaining that it is a workaround for CI infra |
I opened a new PR with just the apt retry changes: #174 |
Use separate rustup and cargo directories for each Github runner in case there are multiple on the same machine, and introduce retries for apt-get to avoid errors related to multiple processes trying to install things. Also makes sure that GCC is installed on Linux.