Question about scaling with AlertManager #1271

rrrrover · 2019-07-30T11:38:48Z

My actions before raising this issue

Followed the troubleshooting guide
Read/searched the docs
Searched past issues

Openfaas uses prometheus to monitor function calls, and when function QPS is higher than some threshold autoscale will be triggered.

But after functions are scaled up, the QPS won't go down, so functions will still be scaled up until maxReplicas are reached.

In my opinion, when we scale up functions, the QPS for each function replica will go down, it means the load for each replica will go down.

So when we scale function to X replicas where QPS/X is relatively small, we can stop scale up.

Also when the alert is stop, replicas will be set to minReplicas, QPS per replica will arise and probabily higher than we'd expect

Expected Behaviour

When APIHighInvocationRate alert is fired, function should only scale up to some scale not maxReplicas.
when APIHighInvocationRate is stopped, we should scale down function gracefully just like we scale up, little by little, to finnaly reach a safe QPS per replica

Current Behaviour

When APIHighInvocationRate alert keeps firing (function QPS is high), function replicas will soon reach maxReplicas (default 20)
When APIHighInvocationRate alert stops, function replica will drop to minReplicas (default 1)

Possible Solution

To solve scale up issue, we could change prometheus alert rule, use QPS/replicas. In my local test I use:

sum by(function_name) (rate(gateway_function_invocation_total{code="200"}[10s]) / ignoring(code) gateway_service_count) > 5

To solve scale down issue, we could add a new scale-down endpoint in gateway and add a new prometheus rule to invoke scale-down api when replicas are more than we want

Steps to Reproduce (for bugs)

start minikube, deploy faas-netes and deploy some functions for future test.
invoke function 5+ times per second, I use hey to invoke curl function 6 times per second.

hey -m POST -q 6 -c 1 -d http://some-test-service:8080/ -z 30m http://192.168.99.100:31112/function/curl

kubectl logs -f deploy/gateway -c gateway -n openfaas| grep Scale to watch scale up/down logs

The text was updated successfully, but these errors were encountered:

rrrrover · 2019-07-30T11:39:18Z

rrrrover · 2019-07-30T11:41:37Z

I the picture I shared above, we could see curl function will be scaled up every 40 seconds, according to the default alertmanager settings.
And after I stopped the funtion call, function replicas drop to 1 immediately

alexellis · 2019-07-30T20:54:07Z

Hi @rrrrover, thanks for your interest in the auto-scaling.

I think you've described how the AlertManager option works reasonably well. It's not the only option and this is customisable.

If you are not satisfied with the default auto-scaling for your use-case, you can edit it:

OpenFaaS has an open REST API which you could use to implement your own autoscaling algorithm or controller
You can use the HPAv2 rules in Kubernetes.

HPAv2 would allow you to use either CPU, memory, or custom metrics i.e. QPS (see the metrics gathered from the watchdog / function for this option)

You could edit the AlertManager rules for scaling up

As you identified, scaling down to min replicas corresponds to a resolved alert from AlertManager. I am not sure how much you can expect to edit that experience whilst retaining that semantic.

You can edit the AlertManager rules for scaling up, and that's something I've seen other users doing too. I would suggest you try out your sample PromQL and report back on how it compares for your use-case.

Looking forward to hearing from you soon,

Alex

alexellis · 2019-07-30T20:54:16Z

--
Join Slack to connect with the community
https://docs.openfaas.com/community

rrrrover · 2019-07-31T01:45:32Z

Hi @alexellis , thanks for the reply and the patient guidance.

My use case was inspired by HPAv2 rules in k8s.
HPAv2 rule will ensure each function pod can only use limited resources of the cluster.
In my understanding, each function pod should also handle limited requests per second.

That's why I observe QPS per pod not QPS total in prometheus.

I've tried my new PromQL which fires an alert when each pod handles over 5 requests per second

sum by(function_name) (rate(gateway_function_invocation_total{code="200"}[10s]) / ignoring(code) gateway_service_count) > 5

I send 6 requests to the function pod every second, so it will scale up to 5 pods to resolve the alert.

And I found that when replica finally reaches a desired number, the alert resolved and pods were scaled down to 1. And then alert fired again.

So my propose to scale down by a new prometheus alert is to solve this infinite loop.

We could still observe the QPS per pod, but this time we should pick the threshold carefully so after scale down QPS per pod will not trigger scale-up again.

In this example above, we could scale down with step of 4 pods (20%*maxReplicas) when QPS per pod is less than 1. So QPS(6) / replicas(5) > 1, no scale down triggered, replicas are stable

rrrrover · 2019-07-31T02:07:18Z

OpenFaaS has an open REST API which you could use to implement your own autoscaling algorithm or controller

By this, do you mean the /system/scale-function/{functionname} api? This api seems helpful, I can build up my own controller to trigger this api to scale up/down.

My use case is not a real world request, I was just studying openfaas and thought about the auto-scaling. If this is not openfaas main focus right now, I can close this issue.

BTW I joined the community days ago, very willing to contribute :D

alexellis · 2019-07-31T07:29:58Z

Hi @rrrrover,

I think you have a valid point and I'd like to see how far you can push AlertManager. It may require a separate Go process similar faas-idler to make sure that the scale-up/down is not orthogonal.

What's your name on Slack?

rrrrover · 2019-07-31T10:22:02Z

Hi @alexellis , my name is also rrrrover on slack

alexellis · 2019-07-31T17:17:50Z

@rrrrover would you also be interested in working on this issue? openfaas/faas-netes#483

rrrrover · 2019-08-01T01:51:15Z

@alexellis thank you for your trust, I'd like to work on that issue too.

rrrrover · 2019-08-07T05:19:19Z

Hi @alexellis , I've created a project faas-autoscaler to do autoscaling for openfaas. Would you mind to take some time to have a look at it?
It has some problems in secret binding, but for autoscaling it works just fine, I'll keep improving it.

Currently I use two prometheus rules, one for scale up and one for scale down.
Each time when scale up/down, replica will increase/decrease by deltaReplica until it reaches the limit

deltaReplica = maxReplicas * scalingFactor

Now faas-autoscaler can scale up/down functions normally. I'll do some math to find proper QPS threshold for scale up/down later

rrrrover · 2019-08-14T14:18:03Z

Hi @alexellis , it's been a while since our last talk. I've updated my faas-autoscaler project.
Now faas-autoscaler can control replicas by setting only one prometheus rule:

- alert: APIInvoke
  expr: rate(gateway_function_invocation_total[10s]) / ignoring(code) gateway_service_count >= 0
  for: 5s
  labels:
    service: gateway
    severity: major
    action: auto-scale
    target: 2
    value: "{{ $value }}"
  annotations:
    description: Function invoke on {{ $labels.function_name }}
    summary: Function invoke on {{ $labels.function_name }}

With this rule set, faas-autoscaler will know the desired metric for each function replica, defined the label target: 2. faas-autoscaler will also know current metric , i.e value: "{{ $value }}".
Then faas-autoscaler will calculate the desired replicas:

desiredReplicas = ceil[currentReplicas * ( value / target )]

As the rule expr is always true, alert will keeps firing, so faas-autoscaler will act like it's checking function replicas periodically (every 40 seconds)

lmxia · 2019-08-15T12:14:23Z

How about just simply scale down from current replicas to " currentReplicas - math.Ceil(currentReplicas * scalingFactor)" when resolved event received. Then we need no scale down endpoint.

rrrrover · 2019-08-15T13:20:55Z

Hi @lmxia , thanks for the tips. I've improved faas-autoscaler a little, it uses only one endpoint /system/auto-scale. Because now we know the desired metrics for function and current value, so we could easily calculate the desired replicas using:

desiredReplicas = ceil[currentReplicas * ( value / target )]

I'm still keeping the "old" faas-autoscaler endpoints /system/scale-up and /system/scale-down.
If anyone would in the old way, they should use them both to make autoscale work, I'll provide an example for you.

Let's assume we need to autoscale functions according to the RPS(request-per-second) for each replica, we want RPS in range [50, 100].

When the system receives 1000 function calls per second, the optimal replicas are 10. With the old config set, we will scale up step by step, bring the replicas to 10. And when the system RPS drop to only 100, we should scale down to 1 replica, step by step from receiving scale-down alert.

If we only scale down when the scale-up alert resolves, then we either scale down to minReplica, which leads me to open this issue, or like your suggestion, scale down for a little step math.Ceil(currentReplicas x scalingFactor), which will lead to a waste of resources.

alexellis · 2019-08-19T07:35:33Z

I think this would be a good topic for the next community call, would you be interested in presenting your scaler @rrrrover ?

rrrrover · 2019-08-20T13:32:26Z

Hi @alexellis , thanks for this opportunity. But first I want to know when is the community call? Because I'm in China, we have 9 hours jet lag, I might not have time to join.

alexellis · 2019-10-08T07:09:31Z

Thank you for your work on this

kevin-lindsay-1 · 2020-10-12T19:30:29Z

So when I invoke a long-running process, and it takes a few seconds to give a response (thereby using gateway_function_invocation_total), autoscaling currently increases my count of nodes, but only upon completion (and therefore lags behind the current queued workload).

Similarly, once a burst functions complete, the function scales up, and then back down (by the looks of it, while the function is running) because not enough have completed in the last 5 seconds.

My initial thought is to alter the alert rule to take into account gateway_function_invocation_started, and then from there compare it to gateway_function_invocation_total.

That said, it might simply be more appropriate to calculate a new metric specifically for currently running invocations, and then providing (or calculating) a number of invocations that a particular pod should be able to handle concurrently.

As it stands, autoscaling doesn't really appear to work for longer running (on the order of a minute or two per invocation) functions, because it rubberbands the scaling size based on recent completed invocations, not current invocations.

I'm currently experimenting with a slightly altered alert rule of something like:

sum by (function_name) (
  gateway_function_invocation_started - 
  ignoring (code) gateway_function_invocation_total{code="200"} -
  ignoring (code) gateway_function_invocation_total{code="500"}
)

Apologies for the less-than optimal query, not super experienced with promql.

I see there's some documentation about it being able to be set via ConfigMap, but not really sure what that example should look like. Digging around for that.

kevin-lindsay-1 · 2021-11-04T15:02:05Z

Is this issue still active? There hasn't been any activity in a year.

alexellis changed the title ~~Autoscaling in openfaas may need improvement~~ Question about scaling with AlertManager Jul 30, 2019

alexellis added question support labels Jul 30, 2019

alexellis mentioned this issue Jul 31, 2019

Provide detailed tutorial / documentation on HPAv2 openfaas/faas-netes#483

Closed

hotjunfeng mentioned this issue Sep 4, 2019

Question about KeepAlive #1303

Closed

3 tasks

coolbeevip mentioned this issue Nov 30, 2019

Hi rrrrover/faas-autoscaler#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about scaling with AlertManager #1271

Question about scaling with AlertManager #1271

rrrrover commented Jul 30, 2019 •

edited

Loading

rrrrover commented Jul 30, 2019

rrrrover commented Jul 30, 2019

alexellis commented Jul 30, 2019

alexellis commented Jul 30, 2019

rrrrover commented Jul 31, 2019 •

edited

Loading

rrrrover commented Jul 31, 2019 •

edited

Loading

alexellis commented Jul 31, 2019

rrrrover commented Jul 31, 2019

alexellis commented Jul 31, 2019

rrrrover commented Aug 1, 2019

rrrrover commented Aug 7, 2019 •

edited

Loading

rrrrover commented Aug 14, 2019

lmxia commented Aug 15, 2019

rrrrover commented Aug 15, 2019 •

edited

Loading

alexellis commented Aug 19, 2019

rrrrover commented Aug 20, 2019

alexellis commented Oct 8, 2019

kevin-lindsay-1 commented Oct 12, 2020 •

edited

Loading

kevin-lindsay-1 commented Nov 4, 2021

Question about scaling with AlertManager #1271

Question about scaling with AlertManager #1271

Comments

rrrrover commented Jul 30, 2019 • edited Loading

My actions before raising this issue

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

rrrrover commented Jul 30, 2019

rrrrover commented Jul 30, 2019

alexellis commented Jul 30, 2019

alexellis commented Jul 30, 2019

rrrrover commented Jul 31, 2019 • edited Loading

rrrrover commented Jul 31, 2019 • edited Loading

alexellis commented Jul 31, 2019

rrrrover commented Jul 31, 2019

alexellis commented Jul 31, 2019

rrrrover commented Aug 1, 2019

rrrrover commented Aug 7, 2019 • edited Loading

rrrrover commented Aug 14, 2019

lmxia commented Aug 15, 2019

rrrrover commented Aug 15, 2019 • edited Loading

alexellis commented Aug 19, 2019

rrrrover commented Aug 20, 2019

alexellis commented Oct 8, 2019

kevin-lindsay-1 commented Oct 12, 2020 • edited Loading

kevin-lindsay-1 commented Nov 4, 2021

rrrrover commented Jul 30, 2019 •

edited

Loading

rrrrover commented Jul 31, 2019 •

edited

Loading

rrrrover commented Jul 31, 2019 •

edited

Loading

rrrrover commented Aug 7, 2019 •

edited

Loading

rrrrover commented Aug 15, 2019 •

edited

Loading

kevin-lindsay-1 commented Oct 12, 2020 •

edited

Loading