Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about scaling with AlertManager #1271

Open
3 tasks done
rrrrover opened this issue Jul 30, 2019 · 19 comments
Open
3 tasks done

Question about scaling with AlertManager #1271

rrrrover opened this issue Jul 30, 2019 · 19 comments

Comments

@rrrrover
Copy link

rrrrover commented Jul 30, 2019

My actions before raising this issue

Openfaas uses prometheus to monitor function calls, and when function QPS is higher than some threshold autoscale will be triggered.

But after functions are scaled up, the QPS won't go down, so functions will still be scaled up until maxReplicas are reached.

In my opinion, when we scale up functions, the QPS for each function replica will go down, it means the load for each replica will go down.

So when we scale function to X replicas where QPS/X is relatively small, we can stop scale up.

Also when the alert is stop, replicas will be set to minReplicas, QPS per replica will arise and probabily higher than we'd expect

Expected Behaviour

  1. When APIHighInvocationRate alert is fired, function should only scale up to some scale not maxReplicas.

  2. when APIHighInvocationRate is stopped, we should scale down function gracefully just like we scale up, little by little, to finnaly reach a safe QPS per replica

Current Behaviour

  1. When APIHighInvocationRate alert keeps firing (function QPS is high), function replicas will soon reach maxReplicas (default 20)

  2. When APIHighInvocationRate alert stops, function replica will drop to minReplicas (default 1)

Possible Solution

  1. To solve scale up issue, we could change prometheus alert rule, use QPS/replicas. In my local test I use:

sum by(function_name) (rate(gateway_function_invocation_total{code="200"}[10s]) / ignoring(code) gateway_service_count) > 5

  1. To solve scale down issue, we could add a new scale-down endpoint in gateway and add a new prometheus rule to invoke scale-down api when replicas are more than we want

Steps to Reproduce (for bugs)

  1. start minikube, deploy faas-netes and deploy some functions for future test.
  2. invoke function 5+ times per second, I use hey to invoke curl function 6 times per second.

hey -m POST -q 6 -c 1 -d http://some-test-service:8080/ -z 30m http://192.168.99.100:31112/function/curl

  1. kubectl logs -f deploy/gateway -c gateway -n openfaas| grep Scale to watch scale up/down logs
@rrrrover
Copy link
Author

Screenshot from 2019-07-24 12-32-21

@rrrrover
Copy link
Author

I the picture I shared above, we could see curl function will be scaled up every 40 seconds, according to the default alertmanager settings.
And after I stopped the funtion call, function replicas drop to 1 immediately

@alexellis alexellis changed the title Autoscaling in openfaas may need improvement Question about scaling with AlertManager Jul 30, 2019
@alexellis
Copy link
Member

Hi @rrrrover, thanks for your interest in the auto-scaling.

I think you've described how the AlertManager option works reasonably well. It's not the only option and this is customisable.

If you are not satisfied with the default auto-scaling for your use-case, you can edit it:

  1. OpenFaaS has an open REST API which you could use to implement your own autoscaling algorithm or controller

  2. You can use the HPAv2 rules in Kubernetes.

HPAv2 would allow you to use either CPU, memory, or custom metrics i.e. QPS (see the metrics gathered from the watchdog / function for this option)

  1. You could edit the AlertManager rules for scaling up

As you identified, scaling down to min replicas corresponds to a resolved alert from AlertManager. I am not sure how much you can expect to edit that experience whilst retaining that semantic.

You can edit the AlertManager rules for scaling up, and that's something I've seen other users doing too. I would suggest you try out your sample PromQL and report back on how it compares for your use-case.

Looking forward to hearing from you soon,

Alex

@alexellis
Copy link
Member

--
Join Slack to connect with the community
https://docs.openfaas.com/community

@rrrrover
Copy link
Author

rrrrover commented Jul 31, 2019

Hi @alexellis , thanks for the reply and the patient guidance.

My use case was inspired by HPAv2 rules in k8s.
HPAv2 rule will ensure each function pod can only use limited resources of the cluster.
In my understanding, each function pod should also handle limited requests per second.

That's why I observe QPS per pod not QPS total in prometheus.

I've tried my new PromQL which fires an alert when each pod handles over 5 requests per second

sum by(function_name) (rate(gateway_function_invocation_total{code="200"}[10s]) / ignoring(code) gateway_service_count) > 5

I send 6 requests to the function pod every second, so it will scale up to 5 pods to resolve the alert.

Screenshot from 2019-07-31 09-27-37

And I found that when replica finally reaches a desired number, the alert resolved and pods were scaled down to 1. And then alert fired again.

Screenshot from 2019-07-31 09-27-43

So my propose to scale down by a new prometheus alert is to solve this infinite loop.

We could still observe the QPS per pod, but this time we should pick the threshold carefully so after scale down QPS per pod will not trigger scale-up again.

In this example above, we could scale down with step of 4 pods (20%*maxReplicas) when QPS per pod is less than 1. So QPS(6) / replicas(5) > 1, no scale down triggered, replicas are stable

@rrrrover
Copy link
Author

rrrrover commented Jul 31, 2019

OpenFaaS has an open REST API which you could use to implement your own autoscaling algorithm or controller

By this, do you mean the /system/scale-function/{functionname} api? This api seems helpful, I can build up my own controller to trigger this api to scale up/down.

My use case is not a real world request, I was just studying openfaas and thought about the auto-scaling. If this is not openfaas main focus right now, I can close this issue.

BTW I joined the community days ago, very willing to contribute :D

@alexellis
Copy link
Member

Hi @rrrrover,

I think you have a valid point and I'd like to see how far you can push AlertManager. It may require a separate Go process similar faas-idler to make sure that the scale-up/down is not orthogonal.

What's your name on Slack?

@rrrrover
Copy link
Author

Hi @alexellis , my name is also rrrrover on slack

@alexellis
Copy link
Member

@rrrrover would you also be interested in working on this issue? openfaas/faas-netes#483

@rrrrover
Copy link
Author

rrrrover commented Aug 1, 2019

@alexellis thank you for your trust, I'd like to work on that issue too.

@rrrrover
Copy link
Author

rrrrover commented Aug 7, 2019

Hi @alexellis , I've created a project faas-autoscaler to do autoscaling for openfaas. Would you mind to take some time to have a look at it?
It has some problems in secret binding, but for autoscaling it works just fine, I'll keep improving it.

Currently I use two prometheus rules, one for scale up and one for scale down.
Each time when scale up/down, replica will increase/decrease by deltaReplica until it reaches the limit

deltaReplica = maxReplicas * scalingFactor

Now faas-autoscaler can scale up/down functions normally. I'll do some math to find proper QPS threshold for scale up/down later

@rrrrover
Copy link
Author

Hi @alexellis , it's been a while since our last talk. I've updated my faas-autoscaler project.
Now faas-autoscaler can control replicas by setting only one prometheus rule:

- alert: APIInvoke
  expr: rate(gateway_function_invocation_total[10s]) / ignoring(code) gateway_service_count >= 0
  for: 5s
  labels:
    service: gateway
    severity: major
    action: auto-scale
    target: 2
    value: "{{ $value }}"
  annotations:
    description: Function invoke on {{ $labels.function_name }}
    summary: Function invoke on {{ $labels.function_name }}

With this rule set, faas-autoscaler will know the desired metric for each function replica, defined the label target: 2. faas-autoscaler will also know current metric , i.e value: "{{ $value }}".
Then faas-autoscaler will calculate the desired replicas:

desiredReplicas = ceil[currentReplicas * ( value / target )]

As the rule expr is always true, alert will keeps firing, so faas-autoscaler will act like it's checking function replicas periodically (every 40 seconds)

@lmxia
Copy link

lmxia commented Aug 15, 2019

How about just simply scale down from current replicas to " currentReplicas - math.Ceil(currentReplicas * scalingFactor)" when resolved event received. Then we need no scale down endpoint.

@rrrrover
Copy link
Author

rrrrover commented Aug 15, 2019

Hi @lmxia , thanks for the tips. I've improved faas-autoscaler a little, it uses only one endpoint /system/auto-scale. Because now we know the desired metrics for function and current value, so we could easily calculate the desired replicas using:

desiredReplicas = ceil[currentReplicas * ( value / target )]

I'm still keeping the "old" faas-autoscaler endpoints /system/scale-up and /system/scale-down.
If anyone would in the old way, they should use them both to make autoscale work, I'll provide an example for you.

Let's assume we need to autoscale functions according to the RPS(request-per-second) for each replica, we want RPS in range [50, 100].

When the system receives 1000 function calls per second, the optimal replicas are 10. With the old config set, we will scale up step by step, bring the replicas to 10. And when the system RPS drop to only 100, we should scale down to 1 replica, step by step from receiving scale-down alert.

If we only scale down when the scale-up alert resolves, then we either scale down to minReplica, which leads me to open this issue, or like your suggestion, scale down for a little step math.Ceil(currentReplicas x scalingFactor), which will lead to a waste of resources.

@alexellis
Copy link
Member

I think this would be a good topic for the next community call, would you be interested in presenting your scaler @rrrrover ?

@rrrrover
Copy link
Author

Hi @alexellis , thanks for this opportunity. But first I want to know when is the community call? Because I'm in China, we have 9 hours jet lag, I might not have time to join.

@alexellis
Copy link
Member

Thank you for your work on this

@kevin-lindsay-1
Copy link
Sponsor

kevin-lindsay-1 commented Oct 12, 2020

So when I invoke a long-running process, and it takes a few seconds to give a response (thereby using gateway_function_invocation_total), autoscaling currently increases my count of nodes, but only upon completion (and therefore lags behind the current queued workload).

Similarly, once a burst functions complete, the function scales up, and then back down (by the looks of it, while the function is running) because not enough have completed in the last 5 seconds.

My initial thought is to alter the alert rule to take into account gateway_function_invocation_started, and then from there compare it to gateway_function_invocation_total.

That said, it might simply be more appropriate to calculate a new metric specifically for currently running invocations, and then providing (or calculating) a number of invocations that a particular pod should be able to handle concurrently.

As it stands, autoscaling doesn't really appear to work for longer running (on the order of a minute or two per invocation) functions, because it rubberbands the scaling size based on recent completed invocations, not current invocations.

I'm currently experimenting with a slightly altered alert rule of something like:

sum by (function_name) (
  gateway_function_invocation_started - 
  ignoring (code) gateway_function_invocation_total{code="200"} -
  ignoring (code) gateway_function_invocation_total{code="500"}
)

Apologies for the less-than optimal query, not super experienced with promql.

I see there's some documentation about it being able to be set via ConfigMap, but not really sure what that example should look like. Digging around for that.

@kevin-lindsay-1
Copy link
Sponsor

Is this issue still active? There hasn't been any activity in a year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants