Asymmetric Cluster Formation Problem #7190
Unanswered
ccampanale
asked this question in
Q&A
Replies: 1 comment 2 replies
-
This won't work well at the moment, the cluster needs to be aware of the other servers otherwise they could discover different ones via the dns resolution and result in a partitioned cluster. routes = [
nats://${this.cluster.auth.user}:${this.cluster.auth.password}@${this.cluster.name}:${this.cluster.port}
] |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I run a NATS cluster - 3 nodes in AWS Fargate managed with a Step Functions State Machine for regular restarts and client credentials rotation - which provides the central transport for a Moleculer microservice system.
It's currently running NATS 2.11.7 in Alpine 3.11.1 containers with NodeJS 23.11.0.
The Fargate cluster uses a single AWS Service Discovery namespace with which each task registers after becoming healthy.
Cluster startup is orchestrated with Moleculer services within the container.
This is what the dynamic NATS server configuration looks:
This has generally worked well but we've noticed an issue evolve in our QA environment that starts daily where one or more nodes in the cluster will end up with less routes than the other nodes. For example, our three node cluster typically has a total of 24 routes; each node has 4 routes to the other two nodes for a total of 8. When the total number of routes is less than 24 (typically this surfaces as 8 + 6 + 6) this causes some services to be unreachable by others; a sort of split-brain. We recently saw this issue crop up for the first time in production which has reignited my desire to understand the root cause.
Resource (CPU/Mem) look fine with averages well below reservation and capable of spiking during increased periods of activity.
I'm stumped on what the issue is. While I know this setup is probably less than ideal in multiple ways (which I invite scrutiny on for my awareness) I have a feeling that this could just boil down to something simple I'm missing in the docs or just to a limited understanding of NATS cluster.
The thing I think that stands out to me the most as a possible culprit is the single service discovery route. Unlike the example in the clustering documentation where a single seed is created without any routes in its config, then subsequent servers are added with a single route to the seed server, our configuration results in a single route common to all nodes but which resolves to multiple IP addresses. Perhaps this is what introduces the "randomness" of healthy starts on some days and unhealthy starts on others? 🤷🏻
I did some searching through issues and other discussions and nothing jumped out to me as related, so I figured I would open up a discussion myself. I highly doubt this is an issue with NATS (unless there is a valid conversation to be had about supporting a single service discovery namespace in this manner) so I figured this was the most appropriate approach to soliciting some external perspectives.
I would definitely appreciate any pointers that could help me make some progress in understand and resolving this strange problem. Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions