Add resources requests and limits management #127

jusabatier · 2025-01-21T08:20:54Z

This PR add the ability to set for each app requests and limits for CPU and RAM.

Related to this issue : #102

edevosc2c · 2025-01-21T10:14:18Z

Thank you Julien, my points at the end of my last comment in the issue #102 about java and memory limits still apply in order to merge this PR.

There is a real need to investigate, last thing I want is to get apps that restart every hour because the app doesn't limit its memory itself in order to avoid an OOMKill.

The tests are failing probably because you need to use helm lint in order to fix the linting issues: https://github.com/georchestra/helm-georchestra/actions/runs/12882804969/job/35915756058?pr=127

jusabatier · 2025-01-21T11:22:01Z

Thank you Julien, my points at the end of my last comment in the issue #102 about java and memory limits still apply in order to merge this PR.

Not sure to understand how to test this, does it imply to modify docker's images used ?
Is it a test you will perform on your instance ? Or I have to test it by myself ?

As explained in README update, the default values provided are provided for testing purposes and for prod, they have to be tuned depending on platform usage and traffic.

The tests are failing probably because you need to use helm lint in order to fix the linting issues

I tried but nothing is wrong :

$ helm lint
==> Linting .
[INFO] Chart.yaml: icon is recommended
[WARNING] /home/jusabatier/Workspace/git/helm-georchestra: chart directory is missing these dependencies: postgresql,rabbitmq

1 chart(s) linted, 0 chart(s) failed

edevosc2c · 2025-01-21T11:59:06Z

Not sure to understand how to test this, does it imply to modify docker's images used ? Is it a test you will perform on your instance ? Or I have to test it by myself ?

As explained in README update, the default values provided are provided for testing purposes and for prod, they have to be tuned depending on platform usage and traffic.

You have multiple ways to test that.

The major outcome is to see if java is going to limit its memory usage. So you can either set a low memory limit or/and stress test the application so that it uses as much memory as possible, then see if it reacts correctly without any OOMKill.

I tried but nothing is wrong :

Ha, yes indeed, it's not helm lint but ct lint. You have the output here: https://github.com/georchestra/helm-georchestra/actions/runs/12882804969/job/35915756058?pr=127

jusabatier · 2025-01-21T14:40:38Z

I investigated a little over JVM parameters to use with containers, here is what I found :

Java 10 introduced +UseContainerSupport (enabled by default) which makes the JVM use sane defaults in a container environment. This feature is backported to Java 8 since 8u191
-XX:+UseContainerSupport allows the JVM to read cgroup limits like available CPUs and RAM from the host machine and configure itself accordingly. Doing so allows the JVM to die with an OutOfMemoryError instead of the container being killed
The old (and somewhat broken) flags -XX:{Min|Max}RAMFraction are now deprecated. There is a new flag -XX:MaxRAMPercentage, that takes a value between 0.0 and 100.0 and defaults to 25.0
Setting -Xmx and -Xms disables the automatic heap sizing

Sources :

Summary :

We should remove -Xmx and -Xms and replace with +UseContainerSupport combined with -XX:MaxRAMPercentage=75.0

edevosc2c · 2025-01-21T15:36:33Z

Thank you for the details @jusabatier. I wasn't aware of Setting -Xmx and -Xms disables the automatic heap sizing which is probably what posed some issues with geo2france geoserver when I tried to limit the memory.

Actually, we have an open issue for removing -Xmx and -Xms here: georchestra/georchestra#4354

We should remove -Xmx and -Xms and replace with +UseContainerSupport combined with -XX:MaxRAMPercentage=75.0

Yes, very good idea, this will be a requirement for merging the PR. We can continue this discussion at georchestra/georchestra#4354

Is +UseContainerSupport really important to add if it's enabled by default?

jusabatier · 2025-01-21T15:48:17Z

Is +UseContainerSupport really important to add if it's enabled by default?

I don't think, but if for whatever reason the default change, it's more safe to add it.
Also it can suggest new devops to search what it implies and document themselves.

In the article, they also suggest it can be benefic to :

setup visible processor number to 2x the container limit (so it should be customisable with ENV var or entrypoint.sh)
=> -XX:ActiveProcessorCount=<2x yourCpuLimit>
setup the garbage collector with -XX:+UseParallelGC

Depending on how the app uses threading.

jusabatier · 2025-01-22T08:52:13Z

I read an article @pmauduit share to me on the IRC and they suggest it's better to set :

No CPU limit, only request : Pods always get the CPU requested by their CPU request. They can take advantage of excess CPU too if they have no limit.

Source : https://home.robusta.dev/blog/stop-using-cpu-limits

Do you think we should remove the CPU limits ?

edevosc2c · 2025-01-22T11:47:22Z

I don't think, but if for whatever reason the default change, it's more safe to add it.

If it says that it's default, we shouldn't try to override default with the same value. That would be confusing.

In the article, they also suggest it can be benefic to :

* setup visible processor number to 2x the container limit (so it should be customisable with ENV var or entrypoint.sh)
  => `-XX:ActiveProcessorCount=<2x yourCpuLimit>`

* setup the garbage collector with `-XX:+UseParallelGC`

After reading paketo-buildpacks/libjvm#136 (comment), we can add ActiveProcessorCount but only for the apps still on java 8/11.

No CPU limit, only request : Pods always get the CPU requested by their CPU request. They can take advantage of excess CPU too if they have no limit.

As of right now, we recommend the helm chart to only be deployed on a single node in a Kubernetes cluster.

Requests are useful when you have multiple nodes in the cluster so that Kubernetes understand how to spread all the different apps across all the nodes. So in our case, Requests are useless (even though we are still required to set them for limiting the resources).

I'm still in favor of having limits because there are some apps like geoserver that are very resource hungry and in all of our clients infra, if these apps consume too many resources then this leads to a slow-down of the other apps which is not ideal. So for users using the default settings, we set a good practice to restrict the apps from consuming all the resources and leading to a general slow-down of the geOrchestra platform.

setup the garbage collector with -XX:+UseParallelGC

No idea about this one if we should add it or not. Any ideas @pmauduit?

jusabatier · 2025-01-22T14:42:26Z

As of right now, we recommend the helm chart to only be deployed on a single node in a Kubernetes cluster.

I didn't know about that. Is there a specific reason that forces to deploy on a single node ?

Also, I wonder if current values aren't too high for testing purposes ?
I think we should set low default values (allow to use for test on small clusters like minikube or k3s), and may provide a table like the one above for different usage (medium prod, high prod, ...)
Because IMHO, a prod instance must have customized values apapted to the traffic it handle.

edevosc2c · 2025-01-22T19:03:13Z

Also, I wonder if current values aren't too high for testing purposes ?

Well if someone is using the helm chart in Kubernetes it's to deploy geOrchestra in production. If he wants to test georchestra in a test environment, we have the docker composition: https://github.com/georchestra/docker.

So I would be inclined to continue building this helm chart for production purpose. At least that's why Camptocamp develops this helm chart. In my opinion, the default values should reflect what an average production environment might be.

But we can for sure guide people to adapt these limits.

I didn't know about that. Is there a specific reason that forces to deploy on a single node ?

We just haven't tested it. At camptocamp we rely on the fact that it's a single node because it makes managing the data much easier, we have a central "SFTP" server in order to access all the various PVC created by the modules.

And usually in a multi-node scenario, you need to use NFS in order to have the data accessible from any server in the cluster. And NFS is slower than accessing the data from the drive itself because it's over the network. While you can use longhorn to be able to move the data from a server to another, there is still a performance hit.

Since all the modules weren't designed with Kubernetes in mind nor multiserver in mind, it's possible that you could get worse performance than using a single server. Finally, the majority of the modules can't be highly available so you won't benefit from multiple servers compared to a single server.

edevosc2c · 2025-02-04T10:25:53Z

There is going to be soon a requirement for this because we are going to do our first multiserver deployment of geOrchestra.

jusabatier · 2025-02-13T10:08:36Z

Summary of remaining tasks to validate this PR :

Remove the -Xms -Xmx JAVA options from the jetty docker images
=> Need a PR in georchestra's repository
Use extra_environment in values.yaml to define java options for each app
=> JAVA_OPTIONS="-XX:MaxRAMPercentage=75.0 -XX:+UseParallelGC"
=> For app that still on JAVA 8/11 we could also set -XX:ActiveProcessorCount=<2x appCpuLimit> and document it

Are you agree with this ?
Is there other tasks I forget ?

edevosc2c · 2025-02-13T14:45:21Z

Thanks, Julien, for the follow-up.

Remove the -Xms -Xmx JAVA options from the jetty docker images
=> Need a PR in georchestra's repository

Yes, that would be appreciated. We will need to limit the memory in the docker composition if we remove those: https://github.com/georchestra/docker

Use extra_environment in values.yaml to define java options for each app
=> JAVA_OPTIONS="-XX:MaxRAMPercentage=75.0 -XX:+UseParallelGC"
=> For app that still on JAVA 8/11 we could also set -XX:ActiveProcessorCount=<2x appCpuLimit> and document it

I think this can be included directly inside the kubernetes manifest (YAML files) and not in the values.yaml files.

It's some parameters that will fit everyone needs. There is no point in allowing to change those.

If one wants to bump up the limits, he can do it using "limits" feature in Kubernetes.

jusabatier · 2025-02-14T08:05:18Z

I think this can be included directly inside the kubernetes manifest (YAML files) and not in the values.yaml files.
It's some parameters that will fit everyone needs. There is no point in allowing to change those.
If one wants to bump up the limits, he can do it using "limits" feature in Kubernetes.

Agree for -XX:MaxRAMPercentage=75.0 -XX:+UseParallelGC, but for -XX:ActiveProcessorCount=<2x appCpuLimit> I think it should be set in values.yaml and ocumented as it depends on the cpu limit param.

For the MaxRAMPercentage, should I use 75 or 80 ?

edevosc2c · 2025-02-14T09:03:28Z

For the MaxRAMPercentage, should I use 75 or 80 ?

80 seems great.

but for -XX:ActiveProcessorCount=<2x appCpuLimit> I think it should be set in values.yaml and ocumented as it depends on the cpu limit param.

Can you use helm to calculate the number? "number of CPU in the limits" multiply by "2"

jusabatier · 2025-02-14T10:19:43Z

Can you use helm to calculate the number? "number of CPU in the limits" multiply by "2"

Will not be simple as the CPU limit can be defined as plain number or in m, I think it will complicate the chart :

1/ Define a function to handle milliCPU conversion :

{{- define "toCPU" -}}
  {{- $cpu := . -}}
  {{- if contains "m" $cpu -}}
    {{- /* If milliCPU, divide by 1000 */ -}}
    {{- $cpuInt := . | replace "m" "" | toInt -}}
    {{- $cpuFloat := float64 $cpuInt | mul 0.001 -}}
    {{- $cpuFloat | printf "%f" -}}
  {{- else -}}
    {{- /* Else, do nothing */ -}}
    {{- $cpu -}}
  {{- end -}}
{{- end -}}

2/ Define the value in deployment :

env:
  - name: JAVA_OPTIONS
    value: "-XX:ActiveProcessorCount={{ .Values.resources.limits.cpu | toCPU | multiply 2 | round | max 1 }}"

For the docker images update, I did the PR for :

georchestra
cas
mapstore

Remain :

GeoNetwork => Lot of modules, docker confs... I don't know enough about the structure to do it properly.
Gateway => No Dockerfile in project, look like it should be defined in JAVA_TOOL_OPTIONS env var but not sure, I let someone that know how it works modify this component
ogc_api_records ?

edevosc2c · 2025-02-17T11:38:21Z

Re-reading about the ActiveProcessorCount, this seems necessary ONLY for apps running java 8/11 and using spring-boot based docker image: paketo-buildpacks/libjvm#136 (comment)

Because if you do not set it, then the default logic in Java is fine, if you set 10 CPU in the Kubernetes limits then Java will set "Effective CPU Count" to 10. On java versions higher than 8.

I can't find any feedback about users using ActiveProcessorCount and with Kubernetes that favor to use this parameter. I can only find some articles and some of them do not even recommend, they just state that the parameter exist.

The only case where I found users talking about ActiveProcessorCount, is when java thinks it has all the CPU cores available but it doesn't because there are some cpu limits. Then people use "ActiveProcessorCount" to workaround this issue.

We can skip this parameter in the helm chart for now. We can however document it so that people are aware of it.

GeoNetwork => Lot of modules, docker confs... I don't know enough about the structure to do it properly.
Gateway => No Dockerfile in project, look like it should be defined in JAVA_TOOL_OPTIONS env var but not sure, I let someone that know how it works modify this component
ogc_api_records ?

Gateway Dockerfile is generated from maven, I think. You will have to find how to customize it, I don't remember how.

Geonetwork, it's there: https://github.com/georchestra/geonetwork/blob/georchestra-gn4.4.x/web/src/docker/Dockerfile

ogc_api_records, I don't know. Ask on matrix

jusabatier · 2025-02-18T11:05:45Z

I did all the mentioned PR :

CAS : Update resources management in Dockerfile georchestra-cas-server#33
Gateway : Update resources management in Docker georchestra-gateway#185
GeoNetwork : Update resources management in Dockerfile geonetwork#314
GeoNetwork (gn-cloud-ogc-api-records-service) : Update resources management in Docker geonetwork-microservices#11
GeOrchestra : Change resources management in Dockerfile georchestra#4411
Mapstore : Update resource management in Dockerfile mapstore2-georchestra#731

Let me know if there is anything else I can do for this PR.

edevosc2c · 2025-02-19T11:13:32Z

Thanks @jusabatier.

I'm going to wait for @f-necas to come back from leave next Monday in order to merge all the PR and discuss releasing a new geOrchestra docker images.

Add resources requests and limits management

e39b6f7

Lint corrections

b601273

This was referenced Feb 14, 2025

Update resources management in Dockerfile georchestra/georchestra-cas-server#33

Open

Change resources management in Dockerfile georchestra/georchestra#4411

Open

jusabatier mentioned this pull request Feb 14, 2025

Update resource management in Dockerfile georchestra/mapstore2-georchestra#731

Open

This was referenced Feb 18, 2025

Update resources management in Dockerfile georchestra/geonetwork#314

Open

Update resources management in Docker georchestra/geonetwork-microservices#11

Open

Update resources management in Docker georchestra/georchestra-gateway#185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add resources requests and limits management #127

Add resources requests and limits management #127

jusabatier commented Jan 21, 2025

edevosc2c commented Jan 21, 2025 •

edited

Loading

jusabatier commented Jan 21, 2025

edevosc2c commented Jan 21, 2025 •

edited

Loading

jusabatier commented Jan 21, 2025

edevosc2c commented Jan 21, 2025

jusabatier commented Jan 21, 2025

jusabatier commented Jan 22, 2025

edevosc2c commented Jan 22, 2025

jusabatier commented Jan 22, 2025

edevosc2c commented Jan 22, 2025

edevosc2c commented Feb 4, 2025

jusabatier commented Feb 13, 2025

edevosc2c commented Feb 13, 2025

jusabatier commented Feb 14, 2025

edevosc2c commented Feb 14, 2025 •

edited

Loading

jusabatier commented Feb 14, 2025 •

edited

Loading

edevosc2c commented Feb 17, 2025 •

edited

Loading

jusabatier commented Feb 18, 2025

edevosc2c commented Feb 19, 2025

Add resources requests and limits management #127

Are you sure you want to change the base?

Add resources requests and limits management #127

Conversation

jusabatier commented Jan 21, 2025

edevosc2c commented Jan 21, 2025 • edited Loading

jusabatier commented Jan 21, 2025

edevosc2c commented Jan 21, 2025 • edited Loading

jusabatier commented Jan 21, 2025

edevosc2c commented Jan 21, 2025

jusabatier commented Jan 21, 2025

jusabatier commented Jan 22, 2025

edevosc2c commented Jan 22, 2025

jusabatier commented Jan 22, 2025

edevosc2c commented Jan 22, 2025

edevosc2c commented Feb 4, 2025

jusabatier commented Feb 13, 2025

edevosc2c commented Feb 13, 2025

jusabatier commented Feb 14, 2025

edevosc2c commented Feb 14, 2025 • edited Loading

jusabatier commented Feb 14, 2025 • edited Loading

edevosc2c commented Feb 17, 2025 • edited Loading

jusabatier commented Feb 18, 2025

edevosc2c commented Feb 19, 2025

edevosc2c commented Jan 21, 2025 •

edited

Loading

edevosc2c commented Jan 21, 2025 •

edited

Loading

edevosc2c commented Feb 14, 2025 •

edited

Loading

jusabatier commented Feb 14, 2025 •

edited

Loading

edevosc2c commented Feb 17, 2025 •

edited

Loading