Asymmetric Cluster Formation Problem #7190

ccampanale · 2025-08-20T22:12:05Z

ccampanale
Aug 20, 2025

I run a NATS cluster - 3 nodes in AWS Fargate managed with a Step Functions State Machine for regular restarts and client credentials rotation - which provides the central transport for a Moleculer microservice system.

It's currently running NATS 2.11.7 in Alpine 3.11.1 containers with NodeJS 23.11.0.

The Fargate cluster uses a single AWS Service Discovery namespace with which each task registers after becoming healthy.

resource aws_service_discovery_service nats_clustser_service_discovery_service {
  name = local.service_discovery_service_name

  dns_config {
    namespace_id = aws_service_discovery_private_dns_namespace.nats_cluster_service_discovery_namespace.id

    dns_records {
      ttl  = 10
      type = "A"
    }

    routing_policy = "MULTIVALUE"
  }

  health_check_custom_config {
    failure_threshold = 1
  }
}

Cluster startup is orchestrated with Moleculer services within the container.

The ECS Service is started with a desired count of one which creates a single task and container.
A Moleculer broker is started, configured to connect to the local NATS server once started within the container.
A Moleculer microservice is created/started with the broker which initializes the NATS server by creating the NATS configuration file (dynamically because we pull client credentials in dynamically) then starting the server process.
A cluster reconciliation function is called which detects the size of the cluster. If it's less than 3 the desired count is incremented.

This is what the dynamic NATS server configuration looks:

`port: ${this.clientPort}

monitor_port: ${this.monitorPort}

max_payload: ${this.maxPayloadMB}mb

authorization {

  timeout:  ${this.authorization.timeout}

  users: [
    ${this.authorization.users.map(u => u.toString()).join(',')}
  ]
}

cluster {
    name: ${this.cluster.name}

    port: ${this.cluster.port}

    connect_retries: 5

    authorization {
        user: ${this.cluster.auth.user}
        password: ${this.cluster.auth.password}
        timeout: ${this.cluster.auth.timeout}
    }

    routes = [
        nats://${this.cluster.auth.user}:${this.cluster.auth.password}@${this.cluster.name}:${this.cluster.port}
    ]
}`

This has generally worked well but we've noticed an issue evolve in our QA environment that starts daily where one or more nodes in the cluster will end up with less routes than the other nodes. For example, our three node cluster typically has a total of 24 routes; each node has 4 routes to the other two nodes for a total of 8. When the total number of routes is less than 24 (typically this surfaces as 8 + 6 + 6) this causes some services to be unreachable by others; a sort of split-brain. We recently saw this issue crop up for the first time in production which has reignited my desire to understand the root cause.

Resource (CPU/Mem) look fine with averages well below reservation and capable of spiking during increased periods of activity.

I'm stumped on what the issue is. While I know this setup is probably less than ideal in multiple ways (which I invite scrutiny on for my awareness) I have a feeling that this could just boil down to something simple I'm missing in the docs or just to a limited understanding of NATS cluster.

The thing I think that stands out to me the most as a possible culprit is the single service discovery route. Unlike the example in the clustering documentation where a single seed is created without any routes in its config, then subsequent servers are added with a single route to the seed server, our configuration results in a single route common to all nodes but which resolves to multiple IP addresses. Perhaps this is what introduces the "randomness" of healthy starts on some days and unhealthy starts on others? 🤷🏻

I did some searching through issues and other discussions and nothing jumped out to me as related, so I figured I would open up a discussion myself. I highly doubt this is an issue with NATS (unless there is a valid conversation to be had about supporting a single service discovery namespace in this manner) so I figured this was the most appropriate approach to soliciting some external perspectives.

I would definitely appreciate any pointers that could help me make some progress in understand and resolving this strange problem. Thanks in advance!

wallyqs · 2025-08-20T22:45:06Z

wallyqs
Aug 20, 2025
Maintainer

This won't work well at the moment, the cluster needs to be aware of the other servers otherwise they could discover different ones via the dns resolution and result in a partitioned cluster.

    routes = [
        nats://${this.cluster.auth.user}:${this.cluster.auth.password}@${this.cluster.name}:${this.cluster.port}
    ]

2 replies

wallyqs Aug 20, 2025
Maintainer

One way to try to workaround this is to have a way in Fargate to detect that the number of routes is expected (via /routez), in case it is not then restart the container until it has a dns resolution that makes it cluster with the rest of the nodes.

ccampanale Aug 21, 2025
Author

Thanks for the reply @wallyqs!

This won't work well at the moment, the cluster needs to be aware of the other servers otherwise they could discover different ones via the dns resolution and result in a partitioned cluster.
    routes = [
        nats://${this.cluster.auth.user}:${this.cluster.auth.password}@${this.cluster.name}:${this.cluster.port}
    ]

I would love to better understand your comment here. AFAIK from reading the docs, this published route could be only one node (i.e., the "seed node") in the cluster and each node should be able to become aware of the other nodes due to the gossip protocol. The docs say:

Because the clustering protocol gossips members of the cluster, all servers are able to discover other server in the cluster. When a server is discovered, the discovering server will automatically attempt to connect to it in order to form a full mesh.

In my case, during startup, the FQDN will be either resolve 0, 1, or 2 IP addresses as each node starts. Ultimately 3 IP addresses will be returned but only after all three nodes are online. So it's never just one IP address, like in the examples, but it's always a node in the cluster. And if they cluster gossips about other nodes to form a quorum, why would this not suffice?

I considered your comment and tried to think of s scenario in which a startup could cause a partitioned cluster. One scenario would be if the first node's IP address was not yet resolving from DNS when the second node starts causing node 1 and 2 to be isolated. When the third node starts it would presumably connect to only one of the two nodes, causing the partition.

However, in all instances where we've seen this asymmetric cluster formation occur, one node not having routes to another node is never the problem. Our cluster seems to always form with routes to each other node; i.e., they are all aware of each other to some degree. But for some reason, some nodes have less routes than others which correlates to issues with partitioned clients.

I've considered having only the seed node in the FQDN but I fear this would complicate rolling reboots for the cluster. On the other hand, I'm curious if there would be any potential in querying DNS for the service record and populating the configuration with each registered IP address instead of using the FQDN? 🤔

Any insight you could share on what I might be missing would be appreciated!

One way to try to workaround this is to have a way in Fargate to detect that the number of routes is expected (via /routez), in case it is not then restart the container until it has a dns resolution that makes it cluster with the rest of the nodes.

I've considered adding logic like this to the container healthcheck script as well but could see multiple potential problems. First, the first node could be health for all intents and purposes and not yet have any routes. So the number of routes expected could only determine cluster health after all expected nodes are online. Second, if a node in the cluster is not considered healthy until more nodes are online I believe this will cause containers to be prematurely destroyed and prevent populating the service discovery record.

If I were to take an approach like this, I think it would need to be tactically developed as more of a "cluster health check" rather than per node; e.g., after all three nodes are online and healthy, check the routes for each node, and if any node has less than expected, kill it and allow another to restart. I'm considering this approach cautiously as I feel there could be some interesting failure modes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Asymmetric Cluster Formation Problem #7190

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Asymmetric Cluster Formation Problem #7190

Uh oh!

ccampanale Aug 20, 2025

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

wallyqs Aug 20, 2025 Maintainer

Uh oh!

wallyqs Aug 20, 2025 Maintainer

Uh oh!

ccampanale Aug 21, 2025 Author

ccampanale
Aug 20, 2025

Replies: 1 comment 2 replies

wallyqs
Aug 20, 2025
Maintainer

wallyqs Aug 20, 2025
Maintainer

ccampanale Aug 21, 2025
Author