Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconciliation of Network for a hibernated Shoot takes more than 30s #191

Open
ialidzhikov opened this issue Jun 1, 2023 · 0 comments
Open
Labels
area/networking Networking related kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage)

Comments

@ialidzhikov
Copy link
Member

How to categorize this issue?

/area networking
/kind bug

What happened:
Recently we had the following regression in gardener/gardener - ref gardener/gardener#8005. In short, several steps had too low timeouts. For example the wait step for the Network to be reconciled had timeout of 30s.
With this issue we noticed that for hibernated cilium Shoots, the Network cannot be reconciled for 30s.

Explanation:

  1. In Reconcile func, networking-cilium calls getIPAMMode (ref
    ipamMode, err := getIPAMMode(ctx, a.client, cluster)
    if err != nil {
    return err
    }
    ).
  2. Then getCiliumConfigMap func is being called (ref
    configmap, err := getCiliumConfigMap(ctx, cl, cluster)
    if err != nil {
    return "", err
    }
    ). 3. getCiliumConfigMap tries to build a Shoot client and get a ConfigMap from the Shoot (
    func getCiliumConfigMap(ctx context.Context, cl client.Client, cluster *extensionscontroller.Cluster) (*corev1.ConfigMap, error) {
    _, shootClient, err := util.NewClientForShoot(ctx, cl, cluster.ObjectMeta.Name, client.Options{}, extensionsconfig.RESTOptions{})
    if err != nil {
    return nil, fmt.Errorf("could not create shoot client: %w", err)
    }
    configmap := &corev1.ConfigMap{}
    _ = shootClient.Get(ctx, client.ObjectKey{Namespace: "kube-system", Name: "cilium-config"}, configmap)
    return configmap, nil
    }
    ). When there is no kube-apiserver Pod, this operation will time out.

I locally measured the execution time of the getCiliumConfigMap func with this diff:

diff --git a/pkg/controller/actuator_reconcile.go b/pkg/controller/actuator_reconcile.go
index 9bd3214c..e5c1210d 100644
--- a/pkg/controller/actuator_reconcile.go
+++ b/pkg/controller/actuator_reconcile.go
@@ -17,6 +17,7 @@ package controller
 import (
 	"context"
 	"fmt"
+	"time"
 
 	extensionsconfig "github.com/gardener/gardener/extensions/pkg/apis/config"
 	extensionscontroller "github.com/gardener/gardener/extensions/pkg/controller"
@@ -188,10 +189,14 @@ func getCiliumConfigMap(ctx context.Context, cl client.Client, cluster *extensio
 }
 
 func getIPAMMode(ctx context.Context, cl client.Client, cluster *extensionscontroller.Cluster) (string, error) {
+	start := time.Now()
 	configmap, err := getCiliumConfigMap(ctx, cl, cluster)
 	if err != nil {
 		return "", err
 	}
+	elapsed := time.Since(start)
+	fmt.Printf("elapsed time in getCiliumConfigMap - %+v\n", elapsed)
+
 	if configmap != nil {
 		if ipamMode, ok := configmap.Data["ipam"]; ok {
 			return ipamMode, nil

Logs for non-hibernated Shoot:

{"level":"info","ts":"2023-06-01T08:05:19.365Z","msg":"Starting the reconciliation of network","controller":"network","object":{"name":"aws-local4","namespace":"shoot--i331370--aws-local4"},"namespace":"shoot--i331370--aws-local4","name":"aws-local4","reconcileID":"2e8cbabd-4b35-46dd-9ded-12295ed08774"}
elapsed time in getCiliumConfigMap - 168.544861ms
{"level":"info","ts":"2023-06-01T08:05:19.783Z","msg":"Successfully reconciled Network","controller":"network","object":{"name":"aws-local4","namespace":"shoot--i331370--aws-local4"},"namespace":"shoot--i331370--aws-local4","name":"aws-local4","reconcileID":"2e8cbabd-4b35-46dd-9ded-12295ed08774"}

Logs for hibernated Shoot:

{"level":"info","ts":"2023-06-01T08:52:05.651Z","msg":"Starting the reconciliation of network","controller":"network","object":{"name":"aws-local4","namespace":"shoot--i331370--aws-local4"},"namespace":"shoot--i331370--aws-local4","name":"aws-local4","reconcileID":"86a8726a-0bad-4a1b-8837-109cabbb225b"}
elapsed time in getCiliumConfigMap - 30.002591085s
{"level":"info","ts":"2023-06-01T08:52:35.785Z","msg":"Successfully reconciled Network","controller":"network","object":{"name":"aws-local4","namespace":"shoot--i331370--aws-local4"},"namespace":"shoot--i331370--aws-local4","name":"aws-local4","reconcileID":"86a8726a-0bad-4a1b-8837-109cabbb225b"}

You can see that the getCiliumConfigMap func itself takes more than 30s when the Shoot is hibernated (no kube-apiserver Pod running).

What you expected to happen:
The Network reconciliation to be fast when the Shoot is hibernated.

How to reproduce it (as minimally and precisely as possible):
See above.

Anything else we need to know?:

Environment:

  • Gardener version (if relevant):
  • Extension version: a2715a5
  • Kubernetes version (use kubectl version): Shoot K8s version is v1.25.9
  • Cloud provider or hardware configuration:
  • Others:
@gardener-robot gardener-robot added area/networking Networking related kind/bug Bug labels Jun 1, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Feb 8, 2024
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Networking related kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage)
Projects
None yet
Development

No branches or pull requests

2 participants