Skip to content

Commit

Permalink
Readme
Browse files Browse the repository at this point in the history
  • Loading branch information
dee-kryvenko committed Feb 14, 2025
1 parent c635018 commit 2b08815
Showing 1 changed file with 20 additions and 12 deletions.
32 changes: 20 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
# argocd-autoscaler

This controller can automatically partition shards (destination kubernetes clusters) to ArgoCD Application Controllers,
determine how many App Controller replicas is needed for that partitioning, and scale App Controller accordingly.
determine how many App Controller replicas are needed for that partitioning, and scale App Controllers accordingly.

There is three levels of resolution I can explain how this works and how to use it.
There are three levels of resolution that I can explain how this works and how to use it.

## TL;DR

Level one aka TL;DR version: it will look at prometheus metrics from App Controllers,
determine load index of each destination cluster,
and partition clusters to replicas in the most efficient way.
And, scale the App Controller accordingly too. You can install it using kustomize in three simple steps:
And, scale the App Controllers accordingly, too. You can install it using kustomize in three simple steps:

1. Grab CRDs from [./config/crd](./config/crd)
1. Grab autoscaler from [./config/default](./config/default)
Expand All @@ -28,7 +28,6 @@ patches:
- target:
kind: Deployment
name: argocd-autoscaler
path: argocd-application-controller-remove-replicas-patch.yaml
patch: |-
- op: add
path: /spec/template/spec/containers/0/resources
Expand All @@ -41,6 +40,10 @@ patches:
memory: 128Mi
```
Note that it does only partitioning and horizontal scaling.
It is still on you to appropriately scale App Controllers CPU/mem based on how the load is getting distributed.
This autoscaler aims to keep all replicas utilized equally, so - usual scaling tools would work just fine.
## Advanced configuration
### Customize scaling strategy
Expand All @@ -50,9 +53,9 @@ The things you will likely want to customize are `poll.yaml` and `load-index.yam

The `poll.yaml` uses `PrometheusPoll` and controls what queries are made to Prometheus.
Based on your preference, you may adjust what works for you.
Note that query must return a single value.
Note that queries must return a single value.

The `load-index.yaml` uses `WeightedPNormLoadIndex` to calculate load index using normalized polling values with weights.
The `load-index.yaml` uses `WeightedPNormLoadIndex` to calculate load index using normalized polling results with weights.
Here, you may want to adjust how much each individual metric contributes to the result.

The `evaluation.yaml` uses `MostWantedTwoPhaseHysteresisEvaluation` to observe and promote partitioning.
Expand All @@ -61,18 +64,21 @@ Here, you may want to customize how long should it be observing before electing
Customization recommendations based on defaults are as follows.

Generally speaking, you may want to follow default opinionated `quantile_over_time` sampling and maybe only modify
what are quantiles sampled on.
what are quantiles sampled on and their weights.
Otherwise - you want to keep in mind to keep sampling at high enough quantile and for longer ranges.
This helps to "remember" spikes for longer,
and have them reflected in the load index later.
You will receive a mostly flat load index, and that's what you are aiming for.
Otherwise, spikes will not inform overall partitioning decision in the evaluation phase,
Otherwise, spikes will not inform overall partitioning decisions in the evaluation phase,
and (provided that at idle all shards are at about the same rate of utilization overall) - you will most likely
end up with partitioning of one shard per one replica.
Which is totally an option if this is literally all you care about - to make it automatically schedule new replicas
for new clusters.

It may be tempting to remove evaluation piece and use load index directly at the scaling,
if you want to make scaling more reactive and instant.
But doing that may result in App Controller restarts every poll cycle.
if you want to make scaling more reactive and instantaneous.
Which is fine if you are aiming for one shard = one replica,
but doing that otherwise may result in App Controller restarts every poll cycle and instability.

There is a middle ground which is to still use evaluation,
use `max` over a minimal possible polling period to get latest values in queries,
Expand Down Expand Up @@ -137,15 +143,17 @@ There are also other dashboards generated by `kubebuilder` - runtime telemetry i

### Secured Autoscaler with Prometheus

Generally speaking, for securing communications you may want to use following:
Generally speaking, for securing communications between prometheus and `/metrics` endpoint, typically means:

1. Network Policy that only allows Prometheus to access the `/metrics` endpoint.
1. Network Policy that allows only Prometheus to access the `/metrics` endpoint.
1. Put `/metrics` endpoint behind authentication.
1. Apply TLS encryption for `/metrics` endpoint.

Network policy is already included in the previous example. For the rest - you do need additional RBAC and CertManager.

Once you make sure these prerequisites are deployed, you may want to use [./config/default-secured](./config/default-secured).
This will apply authentication and TLS to the `/metrics` endpoint,
and it will also modify ServiceMonitor to inform Prometheus how to securely scrape it.

## Hardcore

Expand Down

0 comments on commit 2b08815

Please sign in to comment.