subscription state not getting replicated during geo-replication #21612
-
I'm trying to understand why my subscription state is not being replicated from my primary cluster over to a backup cluster. I can see a replication related subscription being created on the backup, but during a fail-over event, my consumers don't continue consuming from the point they left off when they were connected to the primary cluster. In a case where my producer finishes before a fail-over, the replication subscription cursor is always set to the start and is never progressed on the backup. What is meant to happen here? My understanding is that the cursor of the replicated subscription should move along as the consumers are ack'ing messages on the primary (bar the 1 second [default] delay between snapshots plus any network RTT and the time it takes for the backup to progress the cursor). I can prove that messages are being replicated to the backup cluster because I can point my producers to the primary and consume only from the backup. However, during a fail-over I seem to loose messages. The behavior varies based on whether a producer is producing during the fail-over or not. Some basic config:
note: full config: https://github.com/wojtekkedzior/wp-automation/tree/master/k8/pulsar3/charts/pulsar The long story: I'm trying to get replication setup across two pulsar clusters running in Kubernetes on qemu VMs (full K8 and Pulsar config: https://github.com/wojtekkedzior/wp-automation/blob/master/k8/restoreClusterRoot.sh). I installed the standalone zookeeper using helm: repository: bitnami/zookeeper I then create the cluster metadata for each cluster:
Setup the primary:
and the backup:
set up replication on the primary:
set up replication on the backup:
Then I create a single topic on the primary.
At this point I have the following cursors on the primary:
while the backup has these:
Now i will start a consumer on the primary (from now on omitting schemaLedgers and compactedLedger from the output of 'stats-internal')
on the primary:
on the back up:
now I will start a producer which will run against the primary and will continure trickling in messages though out the remainign:
on the backup:
Now I will simulate a fail-over for the consumer client. This will cause the consumer to change over to the backup brokers. Here's the output:
note: prior to the fail over the last message processed by the primary was 27.
While on the the backup:
At this point the 'sub' subscription appears on the backup (due to having the "allowAutoSubscriptionCreation" option set to true as per the default) and the consumer "appears" to go on processing. Now I will remove the fail over and run the producer again. It will once again run against the primary. note: The last processed message ID on the backup was: 4:208:-1 Now that we are back on the primary the IDs jump to: 4:868:-1
backup:
This is a scenario where the producer keeps running throughout the fail-over. which means that as soon as the producer client switches, the 'sub' subscription on the back up starts getting messages, but the backup starts processing from ID 136, so what happened to the messages between 27 and 135? My understanding is that during a fail-over the consumers should continue roughly where they finished on the 'other' cluster. During a scenario where the producer writes all of its data to the primary before a fail-over then once the consumers switch over they get no messages. Does this mean that messages are only replicated while both a producer and a consumer are connected and processing data? Again, my understanding here is that in the case where the producers finish sending its data but there is a failure over during consumption then the consumers should be able to switch over to the backup and continue roughly from where they stopped on the primary (with a second or so of data ). Once the primary is back up, then they should flick over to it and then, again, continue from the point they left on the back (with a second or so). Of course, all this with the assumption that during the fail-over event both clusters can communicate with each other, which is exactly my case as I'm only disconnection the client at the proxy level outside of the Kubernetes cluster itself. According to my observation I can see that the same subscription names are created, but they are not in sync across the clusters. They are two separate subscription happened to be named the same running on different clusters. I would appreciate any ideas as to what I could be missing here and any pointers to where I could troubleshoot this further. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
Message IDs aren't preserved across clusters in replication so that is not a way to compare the subscriptions. There a few known limitations in subscription replication which could result in duplicate messages. For example, replication happens up to the mark delete position and another detail is that batch message index positions aren't replicated. |
Beta Was this translation helpful? Give feedback.
Message IDs aren't preserved across clusters in replication so that is not a way to compare the subscriptions.
There a few known limitations in subscription replication which could result in duplicate messages. For example, replication happens up to the mark delete position and another detail is that batch message index positions aren't replicated.