Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow refresh of informer cache results in delayed processing of Etcd resources #898

Open
unmarshall opened this issue Oct 28, 2024 · 1 comment · Fixed by #900
Open

Slow refresh of informer cache results in delayed processing of Etcd resources #898

unmarshall opened this issue Oct 28, 2024 · 1 comment · Fixed by #900
Assignees
Labels
area/control-plane Control plane related kind/bug Bug status/accepted Issue was accepted as something we need to work on

Comments

@unmarshall
Copy link
Contributor

unmarshall commented Oct 28, 2024

How to categorize this issue?

/area control-plane
/kind bug

What happened:

A new Etcd resource is created. Since etcd-reconciler is watching for Etcd events, it gets a Create event. This even is allowed in. During the reconciliation loop an attempt is made to get the resource:

if result := ctrlutils.GetLatestEtcdPartialObjectMeta(ctx, r.client, etcdObjectKey, etcdPartialObjMetadata); ctrlutils.ShortCircuitReconcileFlow(result) {
return result
}

It is possible that the informer caches are not yet updated. client.Get returns NotFound error. This results in the following:
if apierrors.IsNotFound(err) {
return DoNotRequeue()
}

The reconciler is short circuited and the no further processing is done.

The default cache resync is 10hrs, but in case of gardener, it reconciles again and with every reconcile it adds the following:

metav1.SetMetaDataAnnotation(&e.etcd.ObjectMeta, v1beta1constants.GardenerOperation, v1beta1constants.GardenerOperationReconcile)
metav1.SetMetaDataAnnotation(&e.etcd.ObjectMeta, v1beta1constants.GardenerTimestamp, TimeNow().UTC().Format(time.RFC3339Nano))

See here.

This will generate another event much sooner than the default cache resync period of 10hrs giving etcd-druid another chance to reconcile the event. However this event gets filtered-out and is not processed. See:

func (r *Reconciler) buildPredicate() predicate.Predicate {
// If the reconcile annotation is set then only allow reconciliation if one of the conditions is true:
// 1. There has been a spec update.
// 2. The last reconcile operation has finished.
// It is possible that during the previous reconcile one of the steps errored out. This gets captured in etcd.Status.LastOperation.
// Update of status will generate an event. This event should not trigger a reconcile especially when the reconcile annotation has still
// not been removed (since the last reconcile is not yet successfully completed).
onReconcileAnnotationSetPredicate := predicate.And(
r.hasReconcileAnnotation(),
predicate.Or(lastReconcileHasFinished(), specUpdated()),
)
// If auto-reconcile has been enabled then it should allow reconciliation only on spec change.
autoReconcileOnSpecChangePredicate := predicate.And(
r.autoReconcileEnabled(),
specUpdated(),
)
return predicate.Or(
onReconcileAnnotationSetPredicate,
autoReconcileOnSpecChangePredicate,
)
}

  • r.hasReconcileAnnotation() is true since gardener adds the reconcile annotation.
  • specUpdated() is false as there is no change to the spec in this event.
  • lastReconcileHasFinished() is false since the first time around the event was not processed so no status is present yet.
  • r.autoReconcileEnabled() is false as its not auto reconciled.

As a consequence onReconcileAnnotationSetPredicate predicate will evaluate to false and autoReconcileOnSpecChangePredicate predicate will evaluate to false thus rejecting the event.

The result is that for a long time after the Etcd resource is created, it does not get reconciled. This is time sensitive and it all depends upon how fast the informer cache is updated or how late the create event arrives and if the first create event gets processed.

What you expected to happen:

The predicate should be improved to allow subsequence update events even if no spec has changed especially when there is no status (indicating that it never got reconciled). For gardener use case an update event will be received much sooner but we need to also solve this for non-gardener use cases where we are depending on cache.SyncPeriod which is by default set to 10hr.

How to reproduce it (as minimally and precisely as possible):

It is not always possible to recreated. Create multiple etcd clusters via local gardener and for one or more etcd clusters you will see that it does not get reconciled and only after a long time it gets reconciled.

  • etcd-druid version: v0.23.x

Thanks to @shafeeqes, we were able to find this reason for the following behavior of v0.23.x version of etcd-druid. We also saw this during g/g e2e tests runs but always thought as a flake test since in subsequent attempts the tests passed. It got masked because tests intermittently also failed due to non-etcd reasons making the e2e tests quite flaky.

@unmarshall
Copy link
Contributor Author

This issues was first observed in g/g e2e tests. See issue: gardener/gardener#10739

@unmarshall unmarshall assigned unmarshall and unassigned unmarshall Oct 28, 2024
unmarshall added a commit to unmarshall/etcd-druid that referenced this issue Oct 29, 2024
@unmarshall unmarshall self-assigned this Oct 29, 2024
unmarshall added a commit that referenced this issue Oct 29, 2024
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Oct 29, 2024
anveshreddy18 pushed a commit to anveshreddy18/etcd-druid that referenced this issue Oct 29, 2024
unmarshall added a commit that referenced this issue Oct 29, 2024
@renormalize renormalize reopened this Oct 29, 2024
@gardener-robot gardener-robot added status/accepted Issue was accepted as something we need to work on and removed status/closed Issue is closed (either delivered or triaged) labels Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related kind/bug Bug status/accepted Issue was accepted as something we need to work on
Projects
None yet
3 participants