Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Separate Handling for Connected and Disconnected Clusters in Round-Robin Sharding #21195

Open
RomyKess opened this issue Dec 16, 2024 · 3 comments
Labels
component:sharding enhancement New feature or request

Comments

@RomyKess
Copy link

RomyKess commented Dec 16, 2024

Summary

We use ArgoCD to deploy on over a hundred clusters. This tool works super well for us, except for the fact it doesn't completely support our use case.
Assigning clusters a shard manually is not ideal, especially because our clusters turn on and off often, so adding a capability to round-robin which takes cluster connection status into account would be a great and much needed solution.

Motivation

We have around a hundred clusters, some of which are turned off and some of which are up. Their state changes quite a bit. Round robin isn't perfect as it divides all clusters equally without differentiating between their status (on or off). So a controller with 20 clusters can have 16 which are up, while another controller with 20 clusters can have 5 that are up. The first one might experience 60 OOM kills in one day while the other one will consume half of its memory limits.
This leads to a major waste of resources in all of our ArgoCD instances, which is a problem when that amount is only growing over time.

Proposal

The easiest way to view this is by looking at the difference between the metrics:
sum(argocd_cluster_info) by (pod)
VS
sum(argocd_cluster_connection_status) by (pod)

For us, the numbers might be equal in the first one, but drastically different in the second one.
We suggest adding a new handling to round robin which separates between connected and disconnected clusters.
That way, connected clusters will be divided equally between each controller, and so will disconnected clusters.
This should greatly improve the balancing of the overall memory load. :)

The second metric also sums the clusters with a connection status of Unknown, which we have found don't affect the controllers as much. So dividing the clusters by their connection status and treating the ones with Successful as the connected ones is probably best.

@andrii-korotkov-verkada
Copy link
Contributor

We'd need some form of consistent hashing I guess, since we don't want a cluster connect/disconnect to rehash all other clusters.

@RomyKess
Copy link
Author

RomyKess commented Dec 24, 2024

Makes sense. Consistent hashing is already a provided form of hashing, so maybe it is possible to use that capability so only a certain amount of cluster connect/disconnect will trigger a rehash?
Maybe even ignore cluster disconnect all together, as what we truly mind is the amount of connected on each controller.

@RomyKess
Copy link
Author

I have created a temporary solution for our environment in the form of an ansible playbook. On a schedule, it gets the connected clusters from argo, filters them by connection status and locates their matching secrets. After dividing by the amount of controllers, it runs for each shard the division result amount of times, and then runs for the remainder. Each run it patches the secret with a key value of the current loop's shard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:sharding enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants