More volumes are causing more IO on RKE2 server nodes #8543

serhiynovos · 2024-05-10T08:12:33Z

serhiynovos
May 10, 2024

I'm managing an RKE2 cluster with Longhorn 1.6.1 in an on-prem setup. Yesterday, I installed MinIO using 9 volumes, each configured without replication. I've noticed an increase in the write rate. Although the increase is modest compared to the previous version 1.5.4, our future plans involve deploying approximately 300 additional volumes, most with a replication factor of 3. I'm concerned about the potential I/O load on my server/etcd nodes. Could Longhorn handle this increased load without prematurely wearing out the SSDs on server nodes? Is this behavior normal, or should I be looking into potential issues?

This is my current state with replicas

And how IO rate changing after installing MINIO with total 9 replicas

serhiynovos · 2024-05-13T11:33:33Z

serhiynovos
May 13, 2024
Author

Can anyone confirm that it's normal behaviour or I have to dig deeper to see what is causing higher load. Thanks

0 replies

derekbit · 2024-05-13T11:52:06Z

derekbit
May 13, 2024
Maintainer

Hello @serhiynovos,

Do you mean the number of write operations has increased after installing MinIO? If that's the case, it's likely due to MinIO, and you can investigate why MinIO is writing data to the volumes more frequently. This increase is independent of the type of storage used.
An increase in write operations could shorten the lifespan of the SSDs. However, since object storage systems typically write data using an append method, this approach can mitigate the wearout effect.

6 replies

derekbit May 13, 2024
Maintainer

@serhiynovos Thanks for the update. I thought you meant the longhorn volume. After creating volumes, Longhorn will periodically collect information such as disk info and so on. This information will be updated to volume or other resources, so an increase in the write rate is expected.

serhiynovos May 13, 2024
Author

@derekbit so you think it should not create any issues to have 300 extra volumes where each of them may have 3 replicas? Or there is some limits to volumes count ? I don't think so it will be high load workloads, just bunch of them.

derekbit May 13, 2024
Maintainer

I think we can benchmark the write rates in our lab environment first and compare them with your numbers.

cc @PhanLe1010 @shuo-wu

serhiynovos May 13, 2024
Author

@derekbit Thank you. It will help so much.
Maybe it will require to modify some services which now needs volumes to store static files as images, videos. So instead local volume it can use minio deployed in the same cluster where I'll have only a few (about 9) but big volumes and each client will have own bucket which should reduce amount of volumes and reduce load to the etcd.

PhanLe1010 May 15, 2024
Maintainer

I think we can benchmark the write rates in our lab environment first and compare them with your numbers.

Eric has a good calculation below at #8543 (reply in thread) so we can continue over there

ejweber · 2024-05-15T13:49:55Z

ejweber
May 15, 2024
Maintainer

We have https://longhorn.io/kb/kubernetes-resource-revision-frequency-expectations/ describing the expectations for API server PUT operations in a stable cluster. In that investigation, we focused on update frequency and not on I/O throughput, but it may help explain the behavior your are observing.

There is an improvement planned in #8076 that should help to lower the update frequency for engine and volume objects.

If you have Prometheus active in the cluster, it may help to run some queries like we did in #8114 (comment) to see if they are in line with the "worst case" scenario for a stable cluster in the knowledgebase (i.e. 12 PUTs per minute per engine resource and 12 PUTs per minute per volume resource).

6 replies

PhanLe1010 May 15, 2024
Maintainer

Btw, to enable monitoring:

If using Rancher https://ranchermanager.docs.rancher.com/integrations-in-rancher/monitoring-and-alerting
Otherwise, this is the upstream Prometheus monitoring stack chart https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/README.md

serhiynovos May 15, 2024
Author

Yes I have installed Prometheus on my cluster and now I don't see anything what was doing something strange. I'm aware about issue in 1.5.4 regarding high PUT rate. I also faced it before rancher/rke2#5680 :) So it's not related to it. Currently minio is uninstalled and I don't see any difference in these 2 metrics.

@ejweber I think you are right. It can put additional 11 kb/s write rate to ETCD + pods statuses as well can be updated on etcd side so it can add few more extra kb/s for write rate. On the chart I noticed it had about 40kb/s increase.

(9 volumes)(12 PUTs / minute / volume)(6393 B / PUT) / [(1024 kiB / B)(60 s / min)] = 11 kiB/s

If every volume has 3 replicas this rate I can multiply by 3. Right ?

ejweber May 15, 2024
Maintainer

If every volume has 3 replicas this rate I can multiply by 3. Right ?

Probably no. In the pending #8076 we focus on engine and replica resources because they have a size field that Longhorn updates regularly, even when nothing else "interesting" is going on. The replica object doesn't have a similar field, so it shouldn't experience a high rate of PUT requests.

Unfortunately, I cannot see the full key/legend in the screenshots you provided. In the last one, I only see LIST/WATCH/DELETE in the key/legend, but I think only CREATE/PUT/etc. will lead to the extra writing.

Which API/verb combos are associated with the high lines on the graph?
At what point on the graph did you add volumes?

serhiynovos May 15, 2024
Author

At what point on the graph did you add volumes?
I installed Prometheus right after I installed minio with 9 volumes to check if nothing is not doing a lot of requests to API server. About in a middle of graph I uninstalled minio and removed volumes and as I see nothing has changed on graph.

Which API/verb combos are associated with the high lines on the graph?
Top 3. GET. 4th PUT from. All 4 lines from the same group "coordination.k8s.io"

Probably no.
Which means if I have 300 volumes they have to cause about 360 kb/s which is totally fine for this amount of volumes. But here is another question for this kind of scale. I know longhorn is using iSCSI. Can it be a bottleneck as I know it's using TCP protocol and even pod and volume located on the same node it's not directly mounted to the pod but all is going through TCP connection?

PhanLe1010 May 15, 2024
Maintainer

Which means if I have 300 volumes they have to cause about 360 kb/s which is totally fine for this amount of volumes. But here is another question for this kind of scale. I know longhorn is using iSCSI. Can it be a bottleneck as I know it's using TCP protocol and even pod and volume located on the same node it's not directly mounted to the pod but all is going through TCP connection?

The flow would be kernel -> iSCSI initiator -> tgt Linux target -> engine -> replicas. Only the path from engine -> replicas is TCP connection. iSCSI initiator -> tgt Linux target and tgt Linux target-> engine paths are unix socket domain connections

In our experience, the bottleneck is usually the TCP connection from engine->replica on different nodes or CPU/RAM limit of the nodes. We don't normally see the bottleneck being the TCP connection from engine -> replica on the same node

That being said, if you don't need replication and always want the pod and volume one the same nodes, you can you use data locality strict-local mode. This will make the engine-replica path uses unix socket domain https://longhorn.io/docs/1.6.1/high-availability/data-locality/#:~:text=strict%2Dlocal%3A%20This%20option%20enforces%20Longhorn%20keep%20the%20only%20one%20replica%20on%20the%20same%20node%20as%20the%20attached%20volume%2C%20and%20therefore%2C%20it%20offers%20higher%20IOPS%20and%20lower%20latency%20performance

serhiynovos · 2024-06-12T21:48:30Z

serhiynovos
Jun 12, 2024
Author

@ejweber BTW just noticed that on worker nodes where longhorn has replica I see constant write rate about 600kb/s. iostat and iotop show that writing is make mostly by longhorn processes to to the disk and mount point which is scheduled for volumes and write is about 20 - 30 kb for each pvc. As I have now about 30 replicas on this pod I suppose it's expected behavior and write rate.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More volumes are causing more IO on RKE2 server nodes #8543

{{title}}

Replies: 4 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

More volumes are causing more IO on RKE2 server nodes #8543

serhiynovos May 10, 2024

Replies: 4 comments · 12 replies

serhiynovos May 13, 2024 Author

derekbit May 13, 2024 Maintainer

derekbit May 13, 2024 Maintainer

serhiynovos May 13, 2024 Author

derekbit May 13, 2024 Maintainer

serhiynovos May 13, 2024 Author

PhanLe1010 May 15, 2024 Maintainer

ejweber May 15, 2024 Maintainer

PhanLe1010 May 15, 2024 Maintainer

serhiynovos May 15, 2024 Author

ejweber May 15, 2024 Maintainer

serhiynovos May 15, 2024 Author

PhanLe1010 May 15, 2024 Maintainer

serhiynovos Jun 12, 2024 Author

serhiynovos
May 10, 2024

Replies: 4 comments 12 replies

serhiynovos
May 13, 2024
Author

derekbit
May 13, 2024
Maintainer

derekbit May 13, 2024
Maintainer

serhiynovos May 13, 2024
Author

derekbit May 13, 2024
Maintainer

serhiynovos May 13, 2024
Author

PhanLe1010 May 15, 2024
Maintainer

ejweber
May 15, 2024
Maintainer

PhanLe1010 May 15, 2024
Maintainer

serhiynovos May 15, 2024
Author

ejweber May 15, 2024
Maintainer

serhiynovos May 15, 2024
Author

PhanLe1010 May 15, 2024
Maintainer

serhiynovos
Jun 12, 2024
Author