Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): stuck in scale event after canary service switch delay. Fixes #3412 #3413

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

BrunoTarijon
Copy link

@BrunoTarijon BrunoTarijon commented Feb 29, 2024

Start a service reconcile if a rs is "still referenced" to fix this issue.

WIP - e2e test

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@BrunoTarijon BrunoTarijon force-pushed the fix-stock-rollout-scale-during-delay-service-switch branch from 8a35ef7 to 3d43872 Compare February 29, 2024 04:19
Copy link

codecov bot commented Feb 29, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 82.80%. Comparing base (8405f2e) to head (5acc1af).
Report is 53 commits behind head on master.

Files Patch % Lines
rollout/canary.go 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3413      +/-   ##
==========================================
+ Coverage   81.83%   82.80%   +0.96%     
==========================================
  Files         135      135              
  Lines       20688    17002    -3686     
==========================================
- Hits        16931    14079    -2852     
+ Misses       2883     2035     -848     
- Partials      874      888      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Feb 29, 2024

Go Published Test Results

2 142 tests   2 142 ✅  2m 51s ⏱️
  119 suites      0 💤
    1 files        0 ❌

Results for commit 5acc1af.

♻️ This comment has been updated with latest results.

@BrunoTarijon BrunoTarijon changed the title fix(controller): stuck in scale event after canary service switch delay fix(controller): stuck in scale event after canary service switch delay. Fixes #3412 Feb 29, 2024
Copy link
Contributor

github-actions bot commented Feb 29, 2024

E2E Tests Published Test Results

  4 files    4 suites   3h 52m 59s ⏱️
109 tests  98 ✅  6 💤  5 ❌
482 runs  408 ✅ 24 💤 50 ❌

For more details on these failures, see this check.

Results for commit 5acc1af.

♻️ This comment has been updated with latest results.

Signed-off-by: BrunoTarijon <[email protected]>
@BrunoTarijon BrunoTarijon force-pushed the fix-stock-rollout-scale-during-delay-service-switch branch from 9f8b822 to 7c105d5 Compare March 4, 2024 20:01
@BrunoTarijon BrunoTarijon marked this pull request as ready for review March 4, 2024 20:14
@zachaller zachaller added this to the v1.7 milestone Mar 4, 2024
@jessesuen
Copy link
Member

Rollout controller in a infinity loop trying to reconcile a still referenced replicaset.
All reconciles in this rollout execute the isEscaleEvent, and will never make the service switch.

The approach that this PR takes, is that it introduces reconciliation of the rollouts services, even if the rollout is in an isScalingEvent. However, the original intent/design was to not reconcile anything but replicaset counts during a scaling event, and leave normal reconciliation to handle things like service reconciliation. So this PR will break that rule.

and will never make the service switch.

I would like to confirm/deny if this statement is accurate. My belief is that it eventually will, but only upon the next reconciliation of the Rollout. In your testing, if you were to either:

  1. wait 15 minutes (the default rollout resync period)
  2. restart the controller
  3. tickle the rollout (e.g. add an annotation)

I believe the rollout would get "unstuck". If what I say is true, then I think the fix should be done in a different way. That is, the scaling Event could requeue the rollout so that it performs normal reconciliation to unstuck itself for the service.

NOTE: I haven't tested my claims, so I could be completely off base. But I'd like to confirm if what I'm saying is true.

@BrunoTarijon
Copy link
Author

I did the test here to confirm.

Waited for 20 minutes. After that, restart the make start-e2e (I think it is the same as restarting the controller). It still stuck.
image

Logs
INFO[0060] Started syncing rollout                       generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay
INFO[0060] Syncing replicas only due to scaling event    namespace=default rollout=alb-canary-with-delay
INFO[0060] Reconciling 1 old ReplicaSets (total pods: 1)  namespace=default rollout=alb-canary-with-delay
INFO[0060] Found 1 available pods in old RS default/alb-canary-with-delay-59ffd4945b  namespace=default rollout=alb-canary-with-delay
INFO[0060] Found 5 available pods, scaling down old RSes (minAvailable: 3, maxScaleDown: 2)  namespace=default rollout=alb-canary-with-delay
INFO[0060] Proceed with scaledown: true                  namespace=default rollout=alb-canary-with-delay
INFO[0060] Skip scale down of older RS 'alb-canary-with-delay-59ffd4945b': still referenced  namespace=default rollout=alb-canary-with-delay
INFO[0060] No status changes. Skipping patch             generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay
INFO[0060] Queueing up Rollout for a progress check now  namespace=default rollout=alb-canary-with-delay
INFO[0060] Reconciliation completed                      generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay time_ms=0.43729100000000004
INFO[0060] Started syncing rollout                       generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay
INFO[0060] Syncing replicas only due to scaling event    namespace=default rollout=alb-canary-with-delay
INFO[0060] Reconciling 1 old ReplicaSets (total pods: 1)  namespace=default rollout=alb-canary-with-delay
INFO[0060] Found 1 available pods in old RS default/alb-canary-with-delay-59ffd4945b  namespace=default rollout=alb-canary-with-delay
INFO[0060] Found 5 available pods, scaling down old RSes (minAvailable: 3, maxScaleDown: 2)  namespace=default rollout=alb-canary-with-delay
INFO[0060] Proceed with scaledown: true                  namespace=default rollout=alb-canary-with-delay
INFO[0060] Skip scale down of older RS 'alb-canary-with-delay-59ffd4945b': still referenced  namespace=default rollout=alb-canary-with-delay
INFO[0060] No status changes. Skipping patch             generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay
INFO[0060] Queueing up Rollout for a progress check now  namespace=default rollout=alb-canary-with-delay
INFO[0060] Reconciliation completed                      generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay time_ms=0.468333
INFO[0060] Started syncing rollout                       generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay
INFO[0060] Syncing replicas only due to scaling event    namespace=default rollout=alb-canary-with-delay
INFO[0060] Reconciling 1 old ReplicaSets (total pods: 1)  namespace=default rollout=alb-canary-with-delay
INFO[0060] Found 1 available pods in old RS default/alb-canary-with-delay-59ffd4945b  namespace=default rollout=alb-canary-with-delay
INFO[0060] Found 5 available pods, scaling down old RSes (minAvailable: 3, maxScaleDown: 2)  namespace=default rollout=alb-canary-with-delay
INFO[0060] Proceed with scaledown: true                  namespace=default rollout=alb-canary-with-delay
INFO[0060] Skip scale down of older RS 'alb-canary-with-delay-59ffd4945b': still referenced  namespace=default rollout=alb-canary-with-delay
INFO[0060] No status changes. Skipping patch             generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay
INFO[0060] Queueing up Rollout for a progress check now  namespace=default rollout=alb-canary-with-delay
INFO[0060] Reconciliation completed                      generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay time_ms=2.182709
INFO[0060] Started syncing rollout                       generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay
INFO[0060] Syncing replicas only due to scaling event    namespace=default rollout=alb-canary-with-delay
INFO[0060] Reconciling 1 old ReplicaSets (total pods: 1)  namespace=default rollout=alb-canary-with-delay
INFO[0060] Found 1 available pods in old RS default/alb-canary-with-delay-59ffd4945b  namespace=default rollout=alb-canary-with-delay
INFO[0060] Found 5 available pods, scaling down old RSes (minAvailable: 3, maxScaleDown: 2)  namespace=default rollout=alb-canary-with-delay
INFO[0060] Proceed with scaledown: true                  namespace=default rollout=alb-canary-with-delay
INFO[0060] Skip scale down of older RS 'alb-canary-with-delay-59ffd4945b': still referenced  namespace=default rollout=alb-canary-with-delay
INFO[0060] No status changes. Skipping patch             generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay
INFO[0060] Queueing up Rollout for a progress check now  namespace=default rollout=alb-canary-with-delay
INFO[0060] Reconciliation completed                      generation=5 namespace=default resourceVersion=79977 rollout=alb-canary-with-delay time_ms=0.5967089999999999

@BrunoTarijon
Copy link
Author

Maybe I can requeue the rollout and add some kind of skip scale.

@BrunoTarijon BrunoTarijon force-pushed the fix-stock-rollout-scale-during-delay-service-switch branch from 444f57c to 5acc1af Compare March 23, 2024 19:49
Copy link

sonarcloud bot commented Mar 23, 2024

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
9.0% Duplication on New Code

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rollout stuck when a scale event happens durring a service switch delay
3 participants