Slow refresh of informer cache results in delayed processing of Etcd resources #898
Labels
area/control-plane
Control plane related
kind/bug
Bug
status/accepted
Issue was accepted as something we need to work on
How to categorize this issue?
/area control-plane
/kind bug
What happened:
A new
Etcd
resource is created. Sinceetcd-reconciler
is watching forEtcd
events, it gets aCreate
event. This even is allowed in. During the reconciliation loop an attempt is made to get the resource:etcd-druid/internal/controller/etcd/reconciler.go
Lines 135 to 137 in df3ff21
It is possible that the informer caches are not yet updated.
client.Get
returnsNotFound
error. This results in the following:etcd-druid/internal/controller/utils/reconciler.go
Lines 42 to 44 in df3ff21
The reconciler is short circuited and the no further processing is done.
The default cache resync is 10hrs, but in case of gardener, it reconciles again and with every reconcile it adds the following:
See here.
This will generate another event much sooner than the default cache resync period of 10hrs giving etcd-druid another chance to reconcile the event. However this event gets filtered-out and is not processed. See:
etcd-druid/internal/controller/etcd/register.go
Lines 53 to 75 in df3ff21
r.hasReconcileAnnotation()
is true since gardener adds the reconcile annotation.specUpdated()
is false as there is no change to the spec in this event.lastReconcileHasFinished()
is false since the first time around the event was not processed so no status is present yet.r.autoReconcileEnabled()
is false as its not auto reconciled.As a consequence
onReconcileAnnotationSetPredicate
predicate will evaluate to false andautoReconcileOnSpecChangePredicate
predicate will evaluate to false thus rejecting the event.The result is that for a long time after the
Etcd
resource is created, it does not get reconciled. This is time sensitive and it all depends upon how fast the informer cache is updated or how late the create event arrives and if the first create event gets processed.What you expected to happen:
The predicate should be improved to allow subsequence update events even if no spec has changed especially when there is no status (indicating that it never got reconciled). For gardener use case an update event will be received much sooner but we need to also solve this for non-gardener use cases where we are depending on cache.SyncPeriod which is by default set to 10hr.
How to reproduce it (as minimally and precisely as possible):
It is not always possible to recreated. Create multiple etcd clusters via local gardener and for one or more etcd clusters you will see that it does not get reconciled and only after a long time it gets reconciled.
The text was updated successfully, but these errors were encountered: