Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thundering herd causes canary to be stuck in finalising #1600

Open
shysank opened this issue Feb 21, 2024 · 2 comments
Open

Thundering herd causes canary to be stuck in finalising #1600

shysank opened this issue Feb 21, 2024 · 2 comments

Comments

@shysank
Copy link

shysank commented Feb 21, 2024

Describe the bug

When a canary for a deployment doing high rps is promoted to primary, it fails because it doesn't have enough replicas to handle the load.

To Reproduce

  1. Apply a canary object for a high RPS deployment with stepWeight: 3, interval: 1 (3% every min)
  2. Have a somewhat large difference between min and max for HPA. Eg. minReplicas: "3", maxReplicas: "30"
  3. Observe the deployment

Expected behavior

The deployment should succeed

Actual behavior

  1. Canary traffic shift happens successfully in around half an hour since it's 3% every minute
  2. Canary deployment is scaled up by hpa as more traffic is shifted.
  3. At the same time primary deployment is scaled down by the hpa for the same reason.
  4. Canary is promoted to primary.
  5. New primary fails because it can't handle the load.
  6. Canary is stuck in Finalising state

This is because the new primary deployment replicas is only set when hpa ref is nil. This means the new primary deployment replica count will be set to hpa's min and since this is a small value, it cannot handle the load.

Additional context

  • Flagger version: 1.34
  • Kubernetes version: 1.27
  • Service Mesh provider: istio
  • Ingress provider: istio

Workarounds

  1. Adjust stepWeightPromotion to make sure it does a partial traffic shift - Since this already done as part of canary, it seems redundant
  2. Don't have low value of hpa min - This won't be ideal for workloads whose non peak traffic is low resulting in waste of resources
  3. If stepWeightPromotion: 100 (or have another variable like promotionReplicas), primary replicas should be set to canary replicas - This seems logical but not sure how the hpa will react.
@shysank shysank changed the title Thundering herd fails primary during finalising Thundering herd causes canary to be stuck finalising Feb 21, 2024
@shysank shysank changed the title Thundering herd causes canary to be stuck finalising Thundering herd causes canary to be stuck in finalising Feb 21, 2024
@stefanprodan
Copy link
Member

stefanprodan commented Feb 21, 2024

Use stepWeightPromotion to progressively shift traffic back to the primary and thus avoiding thundering herd.

Docs: https://docs.flagger.app/usage/deployment-strategies#canary-release

@shysank
Copy link
Author

shysank commented Feb 21, 2024

@stefanprodan Thanks for the response. stepWeightPromotion is what we're planning to do. Unfortunately a side effect of this is, deployment time doubles for the same strategy without much benefits since we already know that the new build works, and even if didn't we cannot rollback at this point to an older deployment.

Would it make sense to set the new primary replicas to be the same as canary when stepweightPromotion is 100? Or have another variable like primaryReplicas? Happy to work on a patch if either of these makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants