Add Separate Handling for Connected and Disconnected Clusters in Round-Robin Sharding #21195

RomyKess · 2024-12-16T08:10:32Z

Summary

We use ArgoCD to deploy on over a hundred clusters. This tool works super well for us, except for the fact it doesn't completely support our use case.
Assigning clusters a shard manually is not ideal, especially because our clusters turn on and off often, so adding a capability to round-robin which takes cluster connection status into account would be a great and much needed solution.

Motivation

We have around a hundred clusters, some of which are turned off and some of which are up. Their state changes quite a bit. Round robin isn't perfect as it divides all clusters equally without differentiating between their status (on or off). So a controller with 20 clusters can have 16 which are up, while another controller with 20 clusters can have 5 that are up. The first one might experience 60 OOM kills in one day while the other one will consume half of its memory limits.
This leads to a major waste of resources in all of our ArgoCD instances, which is a problem when that amount is only growing over time.

Proposal

The easiest way to view this is by looking at the difference between the metrics:
sum(argocd_cluster_info) by (pod)
VS
sum(argocd_cluster_connection_status) by (pod)

For us, the numbers might be equal in the first one, but drastically different in the second one.
We suggest adding a new handling to round robin which separates between connected and disconnected clusters.
That way, connected clusters will be divided equally between each controller, and so will disconnected clusters.
This should greatly improve the balancing of the overall memory load. :)

The second metric also sums the clusters with a connection status of Unknown, which we have found don't affect the controllers as much. So dividing the clusters by their connection status and treating the ones with Successful as the connected ones is probably best.

andrii-korotkov-verkada · 2024-12-22T03:12:22Z

We'd need some form of consistent hashing I guess, since we don't want a cluster connect/disconnect to rehash all other clusters.

RomyKess · 2024-12-24T09:26:47Z

Makes sense. Consistent hashing is already a provided form of hashing, so maybe it is possible to use that capability so only a certain amount of cluster connect/disconnect will trigger a rehash?
Maybe even ignore cluster disconnect all together, as what we truly mind is the amount of connected on each controller.

RomyKess · 2024-12-24T09:33:48Z

I have created a temporary solution for our environment in the form of an ansible playbook. On a schedule, it gets the connected clusters from argo, filters them by connection status and locates their matching secrets. After dividing by the amount of controllers, it runs for each shard the division result amount of times, and then runs for the remainder. Each run it patches the secret with a key value of the current loop's shard.

RomyKess added the enhancement New feature or request label Dec 16, 2024

andrii-korotkov-verkada added the component:sharding label Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Separate Handling for Connected and Disconnected Clusters in Round-Robin Sharding #21195

Add Separate Handling for Connected and Disconnected Clusters in Round-Robin Sharding #21195

RomyKess commented Dec 16, 2024 •

edited

Loading

andrii-korotkov-verkada commented Dec 22, 2024

RomyKess commented Dec 24, 2024 •

edited

Loading

RomyKess commented Dec 24, 2024

Add Separate Handling for Connected and Disconnected Clusters in Round-Robin Sharding #21195

Add Separate Handling for Connected and Disconnected Clusters in Round-Robin Sharding #21195

Comments

RomyKess commented Dec 16, 2024 • edited Loading

Summary

Motivation

Proposal

andrii-korotkov-verkada commented Dec 22, 2024

RomyKess commented Dec 24, 2024 • edited Loading

RomyKess commented Dec 24, 2024

RomyKess commented Dec 16, 2024 •

edited

Loading

RomyKess commented Dec 24, 2024 •

edited

Loading