[BUG] During CPM of hibernated clusters, two snapshots can be taken at the same time causing issues when `etcd-main` is restored #763

plkokanov · 2024-08-21T06:38:46Z

Describe the bug:

During the deletion of a hibernated cluster for which the control plane was previously migrated to a different seed, the backup-restore container of the etcd-main-0 pod could not be started successfully with the following error:

time="2024-08-20T18:30:10Z" level=error msg="Failed initialization: error while restoring corrupt data: failed to restore snapshot: mismatched event revision while applying delta snapshot, expected 210 but applied 419 " actor=backup-restore-server

This issue can occur if two snapshots are taken at the same time. One possibility for this occurring during control plane migration is that in the migrate phase, gardenlet instructs etcd-main to take a special final full snapshot. It is possible that during this time (or shortly before that) an incremental (delta) snapshot is created in-memory and the process of pushing it to the bucket had already started.

So far we have only observed this behaviour for hibernated shoots. This means that during the migrate phase the etcd-main-0 pod will be created (woken up), gardenlet will take a final full snapshot, and then the etcd-main-0 pod will be deleted.

Expected behavior:

An incremental and full (final) snapshot should not be pushed to the backup bucket with the same timestamp and overlapping revisions so that etcd-main-0 can be successfully restored from backup during the restore phase of control plane migration.

How To Reproduce (as minimally and precisely as possible):

This issue occurs rarely and is hard to reproduce

Logs:
Here are logs from an occurrence of this issue in one of our testmachinery tests:

time="2024-08-20T18:30:06Z" level=info msg="Restoring from base snapshot: Full-00000000-00000210-1719664221.gz" actor=restorer
time="2024-08-20T18:30:04Z" level=info msg="Restoring from base snapshot: Full-00000000-00000210-1719664221.gz" actor=restorer
time="2024-08-20T18:30:04Z" level=info msg="start decompressing the snapshot with gzip compressionPolicy" actor=de-compressor
time="2024-08-20T18:30:04Z" level=info msg="successfully fetched data of base snapshot in 0.0054225 seconds [CompressionPolicy:gzip]" actor=restorer
{"level":"info","ts":1724178604.1535509,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
time="2024-08-20T18:30:04Z" level=info msg="Attempting to apply 1 delta snapshots for restoration." actor=restorer
time="2024-08-20T18:30:04Z" level=info msg="Creating temporary directory /var/etcd/data/restoration.temp for persisting delta snapshots locally." actor=restorer
time="2024-08-20T18:30:04Z" level=info msg="Starting an embedded etcd server..." actor=restorer
...
time="2024-08-20T18:30:06Z" level=info msg="Applying delta snapshots..." actor=restorer
time="2024-08-20T18:30:06Z" level=info msg="Applying first delta snapshot Incr-00000002-00000210-1719664221.gz" actor=restorer
time="2024-08-20T18:30:06Z" level=info msg="start decompressing the snapshot with gzip compressionPolicy" actor=de-compressor
time="2024-08-20T18:30:06Z" level=info msg="successfully decompressed data of delta snapshot in 0.002845945 seconds [CompressionPolicy:gzip]" actor=restorer
time="2024-08-20T18:30:06Z" level=info msg="Applying first delta snapshot Incr-00000002-00000210-1719664221.gz" actor=restorer
time="2024-08-20T18:30:07Z" level=info msg="Responding to status request with: Progress" actor=backup-restore-server
{"level":"info","ts":"2024-08-20T18:30:07.101Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2024-08-20T18:30:07.105Z","caller":"embed/etcd.go:363","msg":"closing etcd server","name":"default","data-dir":"/var/etcd/data/new.etcd.part","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
{"level":"info","ts":"2024-08-20T18:30:07.105Z","caller":"etcdserver/server.go:1485","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"}
{"level":"info","ts":"2024-08-20T18:30:07.105Z","caller":"embed/etcd.go:558","msg":"stopping serving peer traffic","address":"127.0.0.1:43937"}
{"level":"info","ts":"2024-08-20T18:30:07.105Z","caller":"embed/etcd.go:565","msg":"stopped serving peer traffic","address":"127.0.0.1:43937"}
{"level":"info","ts":"2024-08-20T18:30:07.105Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"default","data-dir":"/var/etcd/data/new.etcd.part","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
time="2024-08-20T18:30:07Z" level=error msg="Failed initialization: error while restoring corrupt data: failed to restore snapshot: mismatched event revision while applying delta snapshot, expected 210 but applied 419 " actor=backup-restore-server

The snapshots that were present in the bucket were:

Sadly logs from the migrate phase were no-longer present

Environment (please complete the following information):

Etcd-backup-restore version/commit ID: [email protected]`

Anything else we need to know?:

The text was updated successfully, but these errors were encountered:

plkokanov added the kind/bug Bug label Aug 21, 2024

renormalize self-assigned this Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] During CPM of hibernated clusters, two snapshots can be taken at the same time causing issues when `etcd-main` is restored #763

[BUG] During CPM of hibernated clusters, two snapshots can be taken at the same time causing issues when `etcd-main` is restored #763

plkokanov commented Aug 21, 2024

[BUG] During CPM of hibernated clusters, two snapshots can be taken at the same time causing issues when etcd-main is restored #763

[BUG] During CPM of hibernated clusters, two snapshots can be taken at the same time causing issues when etcd-main is restored #763

Comments

plkokanov commented Aug 21, 2024

[BUG] During CPM of hibernated clusters, two snapshots can be taken at the same time causing issues when `etcd-main` is restored #763

[BUG] During CPM of hibernated clusters, two snapshots can be taken at the same time causing issues when `etcd-main` is restored #763