Running etcd cluster with relatively high number of watchers. #18381

veshij · 2024-07-30T00:21:47Z

veshij
Jul 30, 2024

Hi!
We are hitting performance issues when running an etcd cluster with relatively high number of watchers.
In a stable state this cluster is capable of serving 1M+ active watchers (roughly 10k clients, some have only a few watchers, some have up to 2-3k watchers) w/o any observed issues.
Once any node in the cluster is restarted and clients need to reestablish watchers on other nodes - cluster performance degrades.
CPU load spikes (interestingly enough - most of the time is spent in go gc), read/write latencies degrade.
Any recommendations on how cluster and/or clients could be reconfigured to diffuse such node-restart triggered load spike?

ahrtr · 2024-07-30T08:33:55Z

ahrtr
Jul 30, 2024
Maintainer

In a stable state this cluster is capable of serving 1M+ active watchers (roughly 10k clients, some have only a few watchers, some have up to 2-3k watchers) w/o any observed issues.

Thanks for sharing the data. Would you mind sharing more data, such as RAM, CPU, count of members in the cluster, etc?

(interestingly enough - most of the time is spent in go gc)

I am not surprised by this. When a golang process runs out of memory quota, it will try to do GC. Refer to https://tip.golang.org/doc/gc-guide

some have up to 2-3k watchers

It's recommended to try to distribute the 2-3k watchers to different watch streams instead of sharing the same watch stream. Refer to

Any recommendations on how cluster and/or clients could be reconfigured to diffuse such node-restart triggered load spike?

There are some existing watcher related performance issues in the community. What we can do is to try to optimize as much as we can. You also need to do benchmark test to understand the limitation of your system/cluster.

0 replies

veshij · 2024-07-30T17:11:06Z

veshij
Jul 30, 2024
Author

Would you mind sharing more data, such as RAM, CPU, count of members in the cluster, etc?

Sure!
It's a 5-node cluster running on a 96CPU, 256GB, SSD on-prem hardware.
Database size is ~5GB.
etcd is allowed to consume all that hardware, GOMEMLIMIT also is not set.
Hardware is not saturated after a member restart.
CPU spike after a member restart:

For some reason open_fds are unstable during the "stabilization" period when performance is low.
I do see a high churn rate of connections between etcd nodes, wonder if some timeout is misconfigured?
13654 is a PID of running etcd process, 10.144.77.43 is another member of the same cluster.
High churn rate does has a pretty good correlation with low cluster performance.

oleg@iad8a-rm8-20b:/var/log/dropbox$ sudo lsof -np 13654 | grep 10.144.77.43 | wc -l
668
oleg@iad8a-rm8-20b:/var/log/dropbox$ sudo lsof -np 13654 | grep 10.144.77.43 | wc -l
53
oleg@iad8a-rm8-20b:/var/log/dropbox$ sudo lsof -np 13654 | grep 10.144.77.43 | wc -l
53
oleg@iad8a-rm8-20b:/var/log/dropbox$ sudo lsof -np 13654 | grep 10.144.77.43 | wc -l
84
oleg@iad8a-rm8-20b:/var/log/dropbox$ sudo lsof -np 13654 | grep 10.144.77.43 | wc -l
756

It's recommended to try to distribute the 2-3k watchers to different watch streams instead of sharing the same watch stream.

Can't find a client-side configuration to achieve that. Is there a way to do that w/o creating a separate etcd client (i.e. doubling/tripling the connection count to the cluster)?

#18303

this seems to be unrelated, maybe wrong link?

3 replies

ahrtr Jul 30, 2024
Maintainer

It's a 5-node cluster running on a 96CPU, 256GB, SSD on-prem hardware.

Is the 256GB RAM or disk space?

For some reason open_fds are unstable during the "stabilization" period when performance is low.
I do see a high churn rate of connections between etcd nodes, wonder if some timeout is misconfigured?

The spike in CPU usage might be caused by the unstable and high number of open file descriptors (open_fds), which are likely due to failed connection sockets to the restarted member. I believe there is no timeout configuration related to connection between members. Probably we should look into introducing a back-off retry mechanism to resolve such unstable & high open_fds problem.

Can't find a client-side configuration to achieve that. Is there a way to do that w/o creating a separate etcd client (i.e. doubling/tripling the connection count to the cluster)?

#18303

this seems to be unrelated, maybe wrong link?

etcd client SDK uses context metadata to separate watch stream. Refer to links below,

etcd/client/v3/watch.go

Line 297 in f181ced

    
           func (w *watcher) Watch(ctx context.Context, key string, opts ...OpOption) WatchChan {

etcd/client/v3/watch.go

Line 322 in f181ced

ctxKey := streamKeyFromCtx(ctx)

For example, if a client has 1K watchers. All of the watchers will share the same watch stream by default. If you want to distribute the 1K watchers into 10 separate watch streams, you need to create 10 different contexts, each with its own metadata. Something like below,

var ctxs []context.Context
for i:=0; i<10; i++ {
    contextMetadata := metadata.New(map[string]string{"group": fmt.Sprintf("%02d", i)})
    ctx := context.Background()
    ctx = metadata.NewOutgoingContext(ctx, contextMetadata)
    ctxs = append(ctxs, ctx)
}

Afterwards, pass the context as the first parameter when calling Watch

veshij Jul 30, 2024
Author

Is the 256GB RAM or disk space?

256GB RAM, 1TB SSD
It does not look etcd is constrained by any of the hardware limitations on this node.

The spike in CPU usage might be caused by the unstable and high number of open file descriptors (open_fds), which are likely due to failed connection sockets to the restarted member. I believe there is no timeout configuration related to connection between members. Probably we should look into introducing a back-off retry mechanism to resolve such unstable & high open_fds problem.

I'd expect a short spikes of reconnects right after member is restarted, but it should eventually stabilize.
Looks like we are hitting some timeout which triggers connection recreation or something like that.
Just now cluster was not able to stabilize itself after member restart for almost an hour until we shifter almost all the load away from it.

Thank you for providing details about using different streams, will try to implement that.

ahrtr Jul 30, 2024
Maintainer

Just now cluster was not able to stabilize itself after member restart for almost an hour until we shifter almost all the load away from it.

Most likely it's a system/OS configuration problem instead of etcd's issue. Have you configured the hard & soft limitation of the file description? I recall it's 1024 by default. It seems you already configured that, because I see the fds was up to 4k.

veshij · 2024-07-31T00:50:35Z

veshij
Jul 31, 2024
Author

Could be system-related, but quite unlikely it's something simple as open fd limit or anything like that.
I do see those connections to be successfully established and gracefully terminated later on.

I've added a code to pin watchers to streams (up to 100 per stream), will have some results once it's out in a day or two.

3 replies

veshij Aug 1, 2024
Author

preliminary results: It looks significantly better with watchers distributed over multiple streams.
No high GC time, no connection churn. We have not tested it at full scale yet, but looks very promising.

It takes quite a bit of time (5-10 minutes depending on # of watchers) to etcd_debugging_mvcc_slow_watcher_total drop to 0 and based on our internal metrics it looks like there are delays in watch notifications.

ahrtr Aug 1, 2024
Maintainer

preliminary results: It looks significantly better with watchers distributed over multiple streams.
No high GC time, no connection churn. We have not tested it at full scale yet, but looks very promising.

Thanks for the good news.

It takes quite a bit of time (5-10 minutes depending on # of watchers) to etcd_debugging_mvcc_slow_watcher_total drop to 0 and based on our internal metrics it looks like there are delays in watch notifications.

There are existing performance issues related to watch, refer to #16839 and #17529. We haven't got time to dig into & work on it so far. Contribution is welcome!

BTW, what's the rough rate of the writing requests? Did you only see this problem (high delay of watch response) only when restarting one member?

veshij Aug 1, 2024
Author

BTW, what's the rough rate of the writing requests? Did you only see this problem (high delay of watch response) only when restarting one member?

In the cluster I'm currently looking at overall change rate is ~1-2k per second (~150k keys total, some updated way more frequently than others).
We don't see any noticeable delays in a stable state (system is tolerable to signle-digit second delays).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running etcd cluster with relatively high number of watchers. #18381

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Running etcd cluster with relatively high number of watchers. #18381

veshij Jul 30, 2024

Replies: 3 comments · 6 replies

ahrtr Jul 30, 2024 Maintainer

veshij Jul 30, 2024 Author

ahrtr Jul 30, 2024 Maintainer

veshij Jul 30, 2024 Author

ahrtr Jul 30, 2024 Maintainer

veshij Jul 31, 2024 Author

veshij Aug 1, 2024 Author

ahrtr Aug 1, 2024 Maintainer

veshij Aug 1, 2024 Author

veshij
Jul 30, 2024

Replies: 3 comments 6 replies

ahrtr
Jul 30, 2024
Maintainer

veshij
Jul 30, 2024
Author

ahrtr Jul 30, 2024
Maintainer

veshij Jul 30, 2024
Author

ahrtr Jul 30, 2024
Maintainer

veshij
Jul 31, 2024
Author

veshij Aug 1, 2024
Author

ahrtr Aug 1, 2024
Maintainer

veshij Aug 1, 2024
Author