Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch response revision can be lower than the required starting revision #19487

Open
4 tasks done
Congyu-Liu opened this issue Feb 26, 2025 · 3 comments
Open
4 tasks done
Labels

Comments

@Congyu-Liu
Copy link

Bug report criteria

What happened?

Hi developers! When testing etcd, we found a bug that can cause a watch response revision to be less than the required starting revision (e.g., watch key --rev 4 returns a response with revision=3).

What did you expect to happen?

The watch response revision should not be less than the starting revision.

How can we reproduce it (as minimally and precisely as possible)?

The following steps are required to trigger this bug on a follower node:

  1. The node receives the watch request with a starting revision in future (i.e., greater than the current revision)
  2. It restores the snapshot from the leader node ((s *watchableStore) Restore(b backend.Backend) is called), e.g., after a network partition; at this point, the watch revision is still greater than the current revision
  3. syncWatchers() is called
  4. run some put()s

It seems that when restoring the snapshot, the node will move the synced watchers to the unsynced group:

for wa := range s.synced.watchers {
wa.restore = true
s.unsynced.add(wa)
}

Later, when syncWatchers() is called, it decreases the watcher's minRev to curRev + 1, and then moves the watcher back to synced group; then the future put()s will be notified even if their revisions are less than the original watch revision:

for w := range wg.watchers {
if w.minRev < compactionRev {
// Skip the watcher that failed to send compacted watch response due to w.ch is full.
// Next retry of syncWatchers would try to resend the compacted watch response to w.ch
continue
}
w.minRev = curRev + 1
eb, ok := wb[w]
if !ok {
// bring un-notified watcher to synced
s.synced.add(w)
s.unsynced.delete(w)
continue
}

Hope this could be helpful.

Anything else we need to know?

No response

Etcd version (please run commands below)

3.5.18

Etcd configuration (command line flags or environment variables)

No response

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

No response

Relevant log output

@serathius
Copy link
Member

Don't think we provide any guarantees about RV in watch response, as long as it's not a progress notification. They are just local etcd RV when the response was created. As responses are queued up to 100, they can be arbitrarily stale.

I don't think the current semantic of RV in watch response is useful, and fixing the issue you described would not make it any more useful.

@Congyu-Liu
Copy link
Author

Hi @serathius! Thank you for response.

What we are observing is that the watch request does not respect the given start revision, as it returns responses that have lower revision than the start revision. I think this watch start revision semantic is clearly defined in the API documentation

Start_Revision - An optional revision for where to inclusively begin watching. If not given, it will stream events following the revision of the watch creation response header revision. The entire available event history can be watched starting from the last compaction revision.

and the robustness model:

if e.Revision < watch.Revision || !e.Match(watch) {
continue

I am not sure if this is related to staleness (or maybe I miss something).

@Congyu-Liu
Copy link
Author

Congyu-Liu commented Feb 27, 2025

@serathius Below are the steps to reproduce. In summary, we need to first send a watch request with a future starting revision to a follower, then trigger the leader to send a snapshot to this node, finally run some puts.

  1. Set a 3-node cluster, say node A (leader), B (follower), C (follower); --snapshot-count and --experimental-snapshot-catchup-entries is set to 1 to increase the chance of transferring snapshot.
  2. Send watch k --prefix --rev 1000 to B
  3. Partition A <-> B; then run some put requests and move leader to C. This should effectively trigger C to send snapshots to B which is lagging behind.
  4. Run put requests with key prefix equal to k. We should then see the responses from the previous watch request, even though the current revision is less than 1000 (the expected behavior is that we should see nothing before the revision reaches 1000).

Hope it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants