[BUG] During CPM of hibernated clusters, two snapshots can be taken at the same time causing issues when etcd-main
is restored
#763
Labels
kind/bug
Bug
Describe the bug:
During the deletion of a hibernated cluster for which the control plane was previously migrated to a different seed, the
backup-restore
container of theetcd-main-0
pod could not be started successfully with the following error:This issue can occur if two snapshots are taken at the same time. One possibility for this occurring during control plane migration is that in the
migrate
phase,gardenlet
instructsetcd-main
to take a specialfinal
full snapshot. It is possible that during this time (or shortly before that) an incremental (delta) snapshot is created in-memory and the process of pushing it to the bucket had already started.So far we have only observed this behaviour for hibernated shoots. This means that during the
migrate
phase theetcd-main-0
pod will be created (woken up),gardenlet
will take a final full snapshot, and then theetcd-main-0
pod will be deleted.Expected behavior:
An incremental and full (final) snapshot should not be pushed to the backup bucket with the same timestamp and overlapping revisions so that
etcd-main-0
can be successfully restored from backup during therestore
phase of control plane migration.How To Reproduce (as minimally and precisely as possible):
This issue occurs rarely and is hard to reproduce
Logs:
Here are logs from an occurrence of this issue in one of our testmachinery tests:
The snapshots that were present in the bucket were:
Sadly logs from the
migrate
phase were no-longer presentEnvironment (please complete the following information):
Anything else we need to know?:
The text was updated successfully, but these errors were encountered: