Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden reconcile failures/progress failure #3418

Open
1 of 2 tasks
meeech opened this issue Mar 1, 2024 · 5 comments
Open
1 of 2 tasks

Sudden reconcile failures/progress failure #3418

meeech opened this issue Mar 1, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@meeech
Copy link
Contributor

meeech commented Mar 1, 2024

So this is a fun one :D

  • have a rollout that has been just fine for ~24 hours
    • the rollout was in a completed state (can see this based on notifications)
  • suddenly Rollouts starts sending 2 cancelled notifications. Then for the next 11 hours, starts randomly sending cancelled messages. usually in bursts (not just one)
  • in the logs, start seeing a lot of these messages - at a rate of ~ 3 logs every 10 secs:
time="2024-02-27T21:41:41Z" level=error msg="roCtx.reconcile err failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-v1-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-v1-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again" generation=33083 namespace=default resourceVersion=2609627010 rollout=service-x-v1

and these (which i assume is why we start seeing a lot of these cancelled notifs)

- `time="2024-02-27T21:55:33Z" level=warning msg="ReplicaSet \"service-x-v1-==64c6df4d6f==\" has timed out progressing." event_reason=RolloutAborted namespace=default rollout=service-x-v1`
- `time="2024-02-27T21:55:44Z" level=warning msg="ReplicaSet \"service-x-v1-==64c6df4d6f==\" has timed out progressing." event_reason=RolloutAborted namespace=default rollout=service-x-v1`

this went on for a hours. here is a sample - no real pattern timing wise:

...
Argo Rollouts Notifications
APP
7:46 PM
:red-build: service-x-v1 canary deployment was cancelled.
7:46
:red-build: service-x-v1 canary deployment was cancelled.
7:50
:red-build: service-x-v1 canary deployment was cancelled.
7:51
:red-build: service-x-v1 canary deployment was cancelled.
7:51
:red-build: service-x-v1 canary deployment was cancelled.

Argo Rollouts Notifications
APP
8:05 PM
:red-build: service-x-v1 canary deployment was cancelled.
8:05
:red-build: service-x-v1 canary deployment was cancelled.

Argo Rollouts Notifications
APP
8:16 PM
:red-build: service-x-v1 canary deployment was cancelled.
8:16
:red-build: service-x-v1 canary deployment was cancelled.
...

And then 11 hours after the first message, it just stopped. it started 16:55 -> ended 03:55
To be clear - there was no rollout in progress. We have more than 100 rollouts in cluster, but this was the only one where this started happened. (edited)

Can't reproduce. it just 'happened'.

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Version

1.6.6

Logs

In the body. was just those over and over.


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@meeech meeech added the bug Something isn't working label Mar 1, 2024
@meeech meeech changed the title Sudden reconcile failures Sudden reconcile failures/progress failure Mar 1, 2024
@zachaller
Copy link
Collaborator

Did the controller restart around the time it was fixed?

@meeech
Copy link
Contributor Author

meeech commented Mar 2, 2024

As far as I can tell, it didn't.

Digging more, here a bit more info, which is weird:

These seem to be in a cluster together, and are also happening throughout - before, during, and after the bad notifs.

time="2024-02-28T08:56:57Z" level=error msg="roCtx.reconcile err failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again" 
time="2024-02-28T08:56:57Z" level=error msg="failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"
time="2024-02-28T08:56:57Z" level=error msg="rollout syncHandler error: failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent in reconcileNewReplicaSet: failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset service-x-64c6df4d6f: Operation cannot be fulfilled on replicasets.apps \"service-x-64c6df4d6f\": the object has been modified; please apply your changes to the latest version and try again"

Also seeing these in the logs:

03:51:09.660	time="2024-02-28T08:51:09Z" level=warning msg="Removing abort condition from fully promoted rollout" namespace=default rollout=service-x

The first one lines up directly with the first 'wrong' notification received, and the last one in the logs lines up with the last 'wrong' notif received.

and same for

time="2024-02-28T08:55:43Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack argo-rollouts-notifications}, namespace config:  : trigger 'on-rollout-aborted' is not configured"

Which is weird, since this trigger is configured. These Failed to run trigger messages are in log throughout when this was going on.

The removing abort condition... and all the failed to reconcileCanaryReplicaSets in syncReplicasOnly: failed to scaleReplicaSetAndRecordEvent messages do stop - when a new rollout for that service starts hours later.

However, the "Failed to run trigger, trigger: on-rollout-aborted, ... messages stop at 3:55 - exactly when the last message to the slack channel was sent. And looking through the logs, and comparing with timestamps in slack, these messages only show in logs when I have those bad notifs in slack about the cancelled rollouts.

looking at when the notifs start, this coincides with first 'false' alarm notif:

16:55:33.475 time="2024-02-27T21:55:33Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack argo-rollouts-notifications}, namespace config:  : trigger 'on-rollout-aborted' is not configured"
16:55:33.479 time="2024-02-27T21:55:33Z" level=error msg="Failed to run trigger, trigger: on-rollout-aborted, destination: {slack pipelines-eng-alerts}, namespace config:  : trigger 'on-rollout-aborted' is not configured"
16:55:33.701 time="2024-02-27T21:55:33Z" level=warning msg="ReplicaSet \"service-x-64c6df4d6f\" has timed out progressing." event_reason=RolloutAborted namespace=default rollout=orb-service-compile-v1

so both those triggers it says aren't configured are configured.

but beyond that, nothing is standing out.

@Laakso
Copy link

Laakso commented Apr 30, 2024

I also stumbled upon this. Do you happen to have HPA? Only way I was able to fix this was to restart argo rollouts controller.

"level":"info","msg":"rollout syncHandler queue retries: 80 : key \"helloworld/helloworld\"","namespace":"helloworld","rollout":"helloworld","time":"2024-04-30T12:09:18Z"} time="2024-04-30T12:09:18Z" level=error msg="failed to scaleReplicaSetAndRecordEvent in reconcileCanaryStableReplicaSet:L failed to scaleReplicaSet in scaleReplicaSetAndRecordEvent: error updating replicaset helloworld-866f6f5d58: Operation cannot be fulfilled on replicasets.apps \"helloworld-866f6f5d58\": the object has been modified; please apply your changes to the latest version and try again\n" error="<nil>"

@meeech
Copy link
Contributor Author

meeech commented Apr 30, 2024

@Laakso Yes we had HPA on this service

@Laakso
Copy link

Laakso commented Apr 30, 2024

I am suspecting it has something to do with this issue. It is hard to verify though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants