Sudden reconcile failures/progress failure #3418

meeech · 2024-03-01T19:43:05Z

So this is a fun one :D

have a rollout that has been just fine for ~24 hours
- the rollout was in a completed state (can see this based on notifications)
suddenly Rollouts starts sending 2 cancelled notifications. Then for the next 11 hours, starts randomly sending cancelled messages. usually in bursts (not just one)
in the logs, start seeing a lot of these messages - at a rate of ~ 3 logs every 10 secs:

time="2024-02-27T21:41:41Z" level=error msg="roCtx.reconcile err failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-v1-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-v1-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again" generation=33083 namespace=default resourceVersion=2609627010 rollout=service-x-v1

and these (which i assume is why we start seeing a lot of these cancelled notifs)

- `time="2024-02-27T21:55:33Z" level=warning msg="ReplicaSet \"service-x-v1-==64c6df4d6f==\" has timed out progressing." event_reason=RolloutAborted namespace=default rollout=service-x-v1`
- `time="2024-02-27T21:55:44Z" level=warning msg="ReplicaSet \"service-x-v1-==64c6df4d6f==\" has timed out progressing." event_reason=RolloutAborted namespace=default rollout=service-x-v1`

this went on for a hours. here is a sample - no real pattern timing wise:

...
Argo Rollouts Notifications
APP
7:46 PM
:red-build: service-x-v1 canary deployment was cancelled.
7:46
:red-build: service-x-v1 canary deployment was cancelled.
7:50
:red-build: service-x-v1 canary deployment was cancelled.
7:51
:red-build: service-x-v1 canary deployment was cancelled.
7:51
:red-build: service-x-v1 canary deployment was cancelled.

Argo Rollouts Notifications
APP
8:05 PM
:red-build: service-x-v1 canary deployment was cancelled.
8:05
:red-build: service-x-v1 canary deployment was cancelled.

Argo Rollouts Notifications
APP
8:16 PM
:red-build: service-x-v1 canary deployment was cancelled.
8:16
:red-build: service-x-v1 canary deployment was cancelled.
...

And then 11 hours after the first message, it just stopped. it started 16:55 -> ended 03:55
To be clear - there was no rollout in progress. We have more than 100 rollouts in cluster, but this was the only one where this started happened. (edited)

Can't reproduce. it just 'happened'.

Checklist:

I've included steps to reproduce the bug.
I've included the version of argo rollouts.

Version

1.6.6

Logs

In the body. was just those over and over.

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

zachaller · 2024-03-01T22:19:54Z

Did the controller restart around the time it was fixed?

meeech · 2024-03-02T04:23:01Z

As far as I can tell, it didn't.

Digging more, here a bit more info, which is weird:

These seem to be in a cluster together, and are also happening throughout - before, during, and after the bad notifs.

time="2024-02-28T08:56:57Z" level=error msg="roCtx.reconcile err failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again" 
time="2024-02-28T08:56:57Z" level=error msg="failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
time="2024-02-28T08:56:57Z" level=error msg="rollout syncHandler error: failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again"

Also seeing these in the logs:

03:51:09.660	time="2024-02-28T08:51:09Z" level=warning msg="Removing abort condition from fully promoted rollout" namespace=default rollout=service-x

The first one lines up directly with the first 'wrong' notification received, and the last one in the logs lines up with the last 'wrong' notif received.

and same for

time="2024-02-28T08:55:43Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack argo-rollouts-notifications}, namespace config:  : trigger 'on-rollout-aborted' is not configured"

Which is weird, since this trigger is configured. These Failed to run trigger messages are in log throughout when this was going on.

The removing abort condition... and all the failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent messages do stop - when a new rollout for that service starts hours later.

However, the "Failed to run trigger, trigger: on-rollout-aborted, ... messages stop at 3:55 - exactly when the last message to the slack channel was sent. And looking through the logs, and comparing with timestamps in slack, these messages only show in logs when I have those bad notifs in slack about the cancelled rollouts.

looking at when the notifs start, this coincides with first 'false' alarm notif:

16:55:33.475 time="2024-02-27T21:55:33Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack argo-rollouts-notifications}, namespace config:  : trigger 'on-rollout-aborted' is not configured"
16:55:33.479 time="2024-02-27T21:55:33Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack pipelines-eng-alerts}, namespace config:  : trigger 'on-rollout-aborted' is not configured"
16:55:33.701 time="2024-02-27T21:55:33Z" level=warning msg="ReplicaSet \"service-x-64c6df4d6f\" has timed out progressing." event_reason=RolloutAborted namespace=default rollout=orb-service-compile-v1

so both those triggers it says aren't configured are configured.

but beyond that, nothing is standing out.

Laakso · 2024-04-30T12:10:03Z

I also stumbled upon this. Do you happen to have HPA? Only way I was able to fix this was to restart argo rollouts controller.

"level":"info","msg":"rollout syncHandler queue retries: 80 : key \"helloworld/helloworld\"","namespace":"helloworld","rollout":"helloworld","time":"2024-04-30T12:09:18Z"} time="2024-04-30T12:09:18Z" level=error msg="failed to scaleReplicaSetAndRecordEvent in reconcileCanaryStableReplicaSet:L failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset helloworld-866f6f5d58: Operation cannot be fulfilled on replicasets.apps \"helloworld-866f6f5d58\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"

meeech · 2024-04-30T19:34:52Z

@Laakso Yes we had HPA on this service

Laakso · 2024-04-30T20:01:32Z

I am suspecting it has something to do with this issue. It is hard to verify though.

meeech added the bug Something isn't working label Mar 1, 2024

meeech changed the title ~~Sudden reconcile failures~~ Sudden reconcile failures/progress failure Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden reconcile failures/progress failure #3418

Sudden reconcile failures/progress failure #3418

meeech commented Mar 1, 2024

zachaller commented Mar 1, 2024

meeech commented Mar 2, 2024

Laakso commented Apr 30, 2024

meeech commented Apr 30, 2024

Laakso commented Apr 30, 2024

Sudden reconcile failures/progress failure #3418

Sudden reconcile failures/progress failure #3418

Comments

meeech commented Mar 1, 2024

zachaller commented Mar 1, 2024

meeech commented Mar 2, 2024

Laakso commented Apr 30, 2024

meeech commented Apr 30, 2024

Laakso commented Apr 30, 2024