You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use ArgoCD to deploy on over a hundred clusters. This tool works super well for us, except for the fact it doesn't completely support our use case.
Assigning clusters a shard manually is not ideal, especially because our clusters turn on and off often, so adding a capability to round-robin which takes cluster connection status into account would be a great and much needed solution.
Motivation
We have around a hundred clusters, some of which are turned off and some of which are up. Their state changes quite a bit. Round robin isn't perfect as it divides all clusters equally without differentiating between their status (on or off). So a controller with 20 clusters can have 16 which are up, while another controller with 20 clusters can have 5 that are up. The first one might experience 60 OOM kills in one day while the other one will consume half of its memory limits.
This leads to a major waste of resources in all of our ArgoCD instances, which is a problem when that amount is only growing over time.
Proposal
The easiest way to view this is by looking at the difference between the metrics:
sum(argocd_cluster_info) by (pod)
VS
sum(argocd_cluster_connection_status) by (pod)
For us, the numbers might be equal in the first one, but drastically different in the second one.
We suggest adding a new handling to round robin which separates between connected and disconnected clusters.
That way, connected clusters will be divided equally between each controller, and so will disconnected clusters.
This should greatly improve the balancing of the overall memory load. :)
The second metric also sums the clusters with a connection status of Unknown, which we have found don't affect the controllers as much. So dividing the clusters by their connection status and treating the ones with Successful as the connected ones is probably best.
The text was updated successfully, but these errors were encountered:
Makes sense. Consistent hashing is already a provided form of hashing, so maybe it is possible to use that capability so only a certain amount of cluster connect/disconnect will trigger a rehash?
Maybe even ignore cluster disconnect all together, as what we truly mind is the amount of connected on each controller.
I have created a temporary solution for our environment in the form of an ansible playbook. On a schedule, it gets the connected clusters from argo, filters them by connection status and locates their matching secrets. After dividing by the amount of controllers, it runs for each shard the division result amount of times, and then runs for the remainder. Each run it patches the secret with a key value of the current loop's shard.
Summary
We use ArgoCD to deploy on over a hundred clusters. This tool works super well for us, except for the fact it doesn't completely support our use case.
Assigning clusters a shard manually is not ideal, especially because our clusters turn on and off often, so adding a capability to round-robin which takes cluster connection status into account would be a great and much needed solution.
Motivation
We have around a hundred clusters, some of which are turned off and some of which are up. Their state changes quite a bit. Round robin isn't perfect as it divides all clusters equally without differentiating between their status (on or off). So a controller with 20 clusters can have 16 which are up, while another controller with 20 clusters can have 5 that are up. The first one might experience 60 OOM kills in one day while the other one will consume half of its memory limits.
This leads to a major waste of resources in all of our ArgoCD instances, which is a problem when that amount is only growing over time.
Proposal
The easiest way to view this is by looking at the difference between the metrics:
sum(argocd_cluster_info) by (pod)
VS
sum(argocd_cluster_connection_status) by (pod)
For us, the numbers might be equal in the first one, but drastically different in the second one.
We suggest adding a new handling to round robin which separates between connected and disconnected clusters.
That way, connected clusters will be divided equally between each controller, and so will disconnected clusters.
This should greatly improve the balancing of the overall memory load. :)
The second metric also sums the clusters with a connection status of Unknown, which we have found don't affect the controllers as much. So dividing the clusters by their connection status and treating the ones with Successful as the connected ones is probably best.
The text was updated successfully, but these errors were encountered: