Skip to content

AzureVMAgentCleanUpTask MUST NOT run more than one cleanup at a time #373

Open
@Peter-Darton-i2

Description

@Peter-Darton-i2

Jenkins and plugins versions report

Environment
Jenkins: 2.361.1
azure-credentials:216.ve0b_4a_485ffc2
azure-sdk:118.v43f74dd9ca_dc
azure-vm-agents:822.v3a18fc3d2de1

What Operating System are you using (both controller, and any agents involved in the problem)?

Linux - 4.14.268-205.500.amzn2.x86_64

Reproduction steps

In real-life: Arrange/wait for Azure to not respond swiftly to API requests (it happens if Azure is "having a really bad day" and possibly for other reasons too).

In a dev environment: add a bunch of calls to Thread.sleep() into AzureVMAgentCleanUpTask.cleanup() such that they won't finish in a timely manner (you may also need to ignore/clear the thread's interrupt status to fully re-create the worst-case scenario)

Either way, the end effect was that calls to do anything with Azure "took forever", meaning that AzureVMAgentCleanUpTask.cleanup() did not complete.

Expected Results

If, when AzureVMAgentCleanUpTask.execute is called, it finds that a previous call is still in progress, then the just-triggered call should abort (and say so to the log).
It should NOT commence another cleanup() to run in parallel.

...and, if it times out waiting for the cleanup to complete, maybe it should abort the task it was waiting for instead of leaving it running.

Regardless of "how" this is achieved, the end effect should be that we never have more than one thread executing AzureVMAgentCleanUpTask.cleanup() at any one time.

Actual Results

Each time the Jenkins controller triggered the AzureVMAgentCleanUpTask, a new parallel run of AzureVMAgentCleanUpTask.cleanup() was commenced.
The Jenkins controller thread dump relealed that we had 5 of them running (in parallel), all in different phases of completion (whereas we should only have one at a time, at most).

It looks like AzureVMAgentCleanUpTask.execute() gets started every 5 minutes, and each run will wait up to 15 minutes for AzureVMAgentCleanUpTask.cleanup() to complete before logging an error
... but if the call to result.get(CLEAN_TIMEOUT_IN_MINUTES, TimeUnit.MINUTES) times out it will leave the cleanup thread running in the background.
...and it never checks to see if a cleanup is already running before starting a new one.

Anything else?

The as-is situation means that, if things are going bad, this plugin makes things worse, effectively DoS-attacking the controller and making the bad situation harder to diagnose & recover.

AzureVMAgentCleanUpTask needs to have some form of "am I already running? If so, don't start a new run" logic to prevent it from having more than one cleanup running in parallel (a AtomicBoolean field could suffice).

It could/should also add in some "if we timeout waiting for clean() to return then send an interrupt" logic (i.e. in the catch(TimeoutException block, include a call to result.cancel(true)) ... but this would only be a good idea if the cleanup code is able to react to interrupts nicely (if it doesn't then we'll simply have to wait for it to finish).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions