Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzureVMAgentCleanUpTask MUST NOT run more than one cleanup at a time #373

Open
Peter-Darton-i2 opened this issue Sep 22, 2022 · 0 comments
Labels

Comments

@Peter-Darton-i2
Copy link

Jenkins and plugins versions report

Environment
Jenkins: 2.361.1
azure-credentials:216.ve0b_4a_485ffc2
azure-sdk:118.v43f74dd9ca_dc
azure-vm-agents:822.v3a18fc3d2de1

What Operating System are you using (both controller, and any agents involved in the problem)?

Linux - 4.14.268-205.500.amzn2.x86_64

Reproduction steps

In real-life: Arrange/wait for Azure to not respond swiftly to API requests (it happens if Azure is "having a really bad day" and possibly for other reasons too).

In a dev environment: add a bunch of calls to Thread.sleep() into AzureVMAgentCleanUpTask.cleanup() such that they won't finish in a timely manner (you may also need to ignore/clear the thread's interrupt status to fully re-create the worst-case scenario)

Either way, the end effect was that calls to do anything with Azure "took forever", meaning that AzureVMAgentCleanUpTask.cleanup() did not complete.

Expected Results

If, when AzureVMAgentCleanUpTask.execute is called, it finds that a previous call is still in progress, then the just-triggered call should abort (and say so to the log).
It should NOT commence another cleanup() to run in parallel.

...and, if it times out waiting for the cleanup to complete, maybe it should abort the task it was waiting for instead of leaving it running.

Regardless of "how" this is achieved, the end effect should be that we never have more than one thread executing AzureVMAgentCleanUpTask.cleanup() at any one time.

Actual Results

Each time the Jenkins controller triggered the AzureVMAgentCleanUpTask, a new parallel run of AzureVMAgentCleanUpTask.cleanup() was commenced.
The Jenkins controller thread dump relealed that we had 5 of them running (in parallel), all in different phases of completion (whereas we should only have one at a time, at most).

It looks like AzureVMAgentCleanUpTask.execute() gets started every 5 minutes, and each run will wait up to 15 minutes for AzureVMAgentCleanUpTask.cleanup() to complete before logging an error
... but if the call to result.get(CLEAN_TIMEOUT_IN_MINUTES, TimeUnit.MINUTES) times out it will leave the cleanup thread running in the background.
...and it never checks to see if a cleanup is already running before starting a new one.

Anything else?

The as-is situation means that, if things are going bad, this plugin makes things worse, effectively DoS-attacking the controller and making the bad situation harder to diagnose & recover.

AzureVMAgentCleanUpTask needs to have some form of "am I already running? If so, don't start a new run" logic to prevent it from having more than one cleanup running in parallel (a AtomicBoolean field could suffice).

It could/should also add in some "if we timeout waiting for clean() to return then send an interrupt" logic (i.e. in the catch(TimeoutException block, include a call to result.cancel(true)) ... but this would only be a good idea if the cleanup code is able to react to interrupts nicely (if it doesn't then we'll simply have to wait for it to finish).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant