Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of Elastic Agents running per cluster profile #247

Open
simcomp2003 opened this issue Apr 14, 2021 · 6 comments
Open

Limit number of Elastic Agents running per cluster profile #247

simcomp2003 opened this issue Apr 14, 2021 · 6 comments

Comments

@simcomp2003
Copy link

I'm don't sure if this is issue or future requests but in Elastic Cluster profile we have "Maximum pending pods" to set but its somehow misleading and we don't have little more documentation about it.

Any way, I think it will be very useful if we have additional field to limit number of possible running Agents in one cluster profile if his Agent profile is requested to run.

Very often K8S resources are limited and we can not overload cluster with 50 running agents if we don't have space for it.

My question is do we have something similar already?

If not can this be created and for default value set unlimited but we can set it and then when pipeline run is scheduled it will look for number of already running agents and if this limit is set and reached it will stay in schedule waiting until its free to run it.
Similar behavior when we have static agents set in place.

Thanks for answer

@wojtek-viirtue
Copy link

+1. There should be a means to constrain the maximum amount of concurrent elastic-agent pods running. In its present state, it can destabilize a cluster even with resource limits set.

@Evesy
Copy link

Evesy commented Mar 2, 2022

A use case for this is when agent profiles are used to run jobs on different node types

In our case we have a number of different node pools with different autoscaling settings. Elastic agent profiles target these different node pools, so elastic agent x spins up on a specific node pool that could accommodate upto e.g. 12 pods, whereas elastic agent y spins up on a different node pool that can accommodate e.g. 50 pods.

It would be useful to limit the number of pods that can be launched for a given elastic profile, so excess jobs instead queue waiting to launch an agent, rather than having an abundance of pending pods which can trigger alerting

@chadlwilson
Copy link
Member

Has anyone tried putting the elastic agents in their own namespace and imposing a pods ResourceQuota?

e.g

apiVersion: v1
kind: ResourceQuota
metadata:
  name: elastic-agent-limit
spec:
  hard:
    pods: "10"

Perhaps the plugin should have this, but Kubernetes does have a lot of ways to limit resource consumption baked into its scheduler which you generally have control over via the elastic profile. # pods can be a bit of a blunt instrument.

While my above suggestion doesn't solve the "multiple node pools" problem unless you split these into different namespaces, I do separately wonder whether it really helps to move a queue from Kubernetes-land (pending pods) back internally to GoCD (where there is likely more limited alerting capacity), rather than tune K8S alerting to something appropriate for the usage? Besides, I thought "maximum pending pods" is supposed to limit that... I wonder what it is doing in that case if it's neither....

@wojtek-viirtue
Copy link

@chadlwilson I did try this and If I remember correctly....the pipelines just failed when the quota was reached.

@chadlwilson
Copy link
Member

@wojtek-viirtue interesting, OK, thanks!

@Evesy
Copy link

Evesy commented Mar 2, 2022

I do separately wonder whether it really helps to move a queue from Kubernetes-land (pending pods) back internally to GoCD (where there is likely more limited alerting capacity), rather than tune K8S alerting to something appropriate for the usage?

I suppose for us there's a separation of concerns between queued tasks and infrastructure problems, and both of those manifesting as pending pods blurs the lines a bit.

On the GoCD side we want to set an upper limit for the overall resources that a particular set of jobs can use, and we can achieve this by having an agent profile with an affinity to a specific node pool, and setting upper autoscaling limits on that node pool. If that size for example was 5 nodes that can accommodate 3 agents each, that gives 15 of these jobs that can be running in parallel.
If for one reason or another, 20 of these jobs are triggered at the same time, 15 will get assigned, the other 5 we would expect to wait, which isn't a problem.

On the infrastructure side pods stuck in a pending state for a prolonged period of time is usually indicative of a problem with the cluster or something similar itself; it could be an autoscaler problem, or something simple such as taints/tolerations being incorrectly changed.

In the first example we don't really need alerts, it's expected behaviour that the pipelines have a concurrency limit, and beyond that they should queue. In the second example, we do want alerts, as it's indicating a problem that needs addressing.
Both of these manifest as pending pods currently, so it becomes hard to have alerts around that criteria.

If instead the plugin exposed an option of max pods for a given agent profile, we could clearly alert on pending pods; since we would never expect there to be pods stuck in a pending state for a long period of time unless there is actual problems.
When more jobs are launched than there is agent availability, the agent plugin would create the right amount of pods that can fit on the available node pool, and any further jobs would wait until capacity is freed up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants