Upgrade doesn't work on EKS #2197

brb · 2023-12-19T11:32:28Z

Installed Cilium with CLI:

 cilium install  --chart-directory=./install/kubernetes/cilium/ --helm-set=debug.enabled=true --helm-set=bpf.monitorAggregation=none --helm-set=image.repository=quay.io/cilium/cilium-ci --helm-set=image.useDigest=false --helm-set=image.tag=v1.15 --helm-set=operator.image.repository=quay.io/cilium/operator --helm-set=operator.image.suffix=-ci --helm-set=operator.image.tag=v1.15 --helm-set=operator.image.useDigest=false --helm-set=clustermesh.apiserver.image.repository=quay.io/cilium/clustermesh-apiserver-ci --helm-set=clustermesh.apiserver.image.tag=v1.15 --helm-set=clustermesh.apiserver.image.useDigest=false --helm-set=hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci --helm-set=hubble.relay.image.tag=v1.15 --helm-set=hubble.relay.image.useDigest=false --cluster-name=cilium-cilium-7249674943-1 --helm-set=hubble.relay.enabled=true --helm-set loadBalancer.l7.backend=envoy --helm-set tls.secretsBackend=k8s --helm-set=bpf.monitorAggregation=none --wait=false

It auto-detected that installing on EKS:

🔮 Auto-detected Kubernetes kind: EKS
ℹ️  Using Cilium version 1.16.0
ℹ️  Using cluster name "cilium-cilium-7249674943-1"
🔮 Auto-detected kube-proxy has been installed

Then decided to upgrade with:

 cilium upgrade  --chart-directory=./install/kubernetes/cilium --helm-set=debug.enabled=true --helm-set=bpf.monitorAggregation=none --helm-set=image.repository=quay.io/cilium/cilium-ci --helm-set=image.useDigest=false --helm-set=image.tag=74c8b3fae3f23111561f7868b07e137a54eb721c --helm-set=operator.image.repository=quay.io/cilium/operator --helm-set=operator.image.suffix=-ci --helm-set=operator.image.tag=74c8b3fae3f23111561f7868b07e137a54eb721c --helm-set=operator.image.useDigest=false --helm-set=clustermesh.apiserver.image.repository=quay.io/cilium/clustermesh-apiserver-ci --helm-set=clustermesh.apiserver.image.tag=74c8b3fae3f23111561f7868b07e137a54eb721c --helm-set=clustermesh.apiserver.image.useDigest=false --helm-set=hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci --helm-set=hubble.relay.image.tag=74c8b3fae3f23111561f7868b07e137a54eb721c --helm-set=hubble.relay.image.useDigest=false --cluster-name=cilium-cilium-7249674943-1 --helm-set=hubble.relay.enabled=true --helm-set loadBalancer.l7.backend=envoy --helm-set tls.secretsBackend=k8s --helm-set=bpf.monitorAggregation=none
... 
Flag --cluster-name has been deprecated, This can now be overridden via `--set` (Helm value: `cluster.name`).
🔮 Auto-detected Kubernetes kind: EKS
ℹ️  Using Cilium version 1.16.0
ℹ️  Using cluster name "cilium-cilium-7249674943-1"
🔮 Auto-detected kube-proxy has been installed

However, after the upgrade, the ipam / endoint-routes / egress-masquerade-interface / routing-mode / etc values were set to defaults (i.e., cluster-pool, disabled, nil, tunnel, etc), which broke the cluster.

The text was updated successfully, but these errors were encountered:

dqsully · 2023-12-21T15:47:37Z

I just had this happen to me in production yesterday, breaking everything for 1h30m until I finished switching back to AWS VPC CNI. I even tried to revert the changes to the cilium-config ConfigMap, but they kept getting overwritten.

When I list the Cilium Helm values for the affected production k8s cluster (using helm status cilium -n kube-system -o json | jq .config), I get

{
  "affinity": {
    "nodeAffinity": {
      "requiredDuringSchedulingIgnoredDuringExecution": {
        "nodeSelectorTerms": [
          {
            "matchExpressions": [
              {
                "key": "io.cilium/aws-node-enabled",
                "operator": "NotIn",
                "values": [
                  "true"
                ]
              }
            ]
          }
        ]
      }
    }
  },
  "eni": {
    "awsEnablePrefixDelegation": true
  },
  "updateStrategy": {
    "type": "OnDelete"
  }
}

But when I list the values for an unaffected staging cluster created with nearly-identical Helm values, I get

{
  "affinity": {
    "nodeAffinity": {
      "requiredDuringSchedulingIgnoredDuringExecution": {
        "nodeSelectorTerms": [
          {
            "matchExpressions": [
              {
                "key": "io.cilium/aws-node-enabled",
                "operator": "NotIn",
                "values": [
                  "true"
                ]
              }
            ]
          }
        ]
      }
    }
  },
  "cluster": {
    "name": "<cluster name>"
  },
  "egressMasqueradeInterfaces": "eth0",
  "eni": {
    "awsEnablePrefixDelegation": true,
    "enabled": true
  },
  "hubble": {
    "relay": {
      "enabled": true
    },
    "ui": {
      "enabled": true
    }
  },
  "ipam": {
    "mode": "eni"
  },
  "operator": {
    "replicas": 1
  },
  "routingMode": "native",
  "serviceAccounts": {
    "cilium": {
      "name": "cilium"
    },
    "operator": {
      "name": "cilium-operator"
    }
  }
}

In both clusters, I've run the commands:

cilium install -f values.yaml --version 1.14.4
cilium upgrade --version 1.14.5

But this is the command that broke my production cluster while I was trying to enable LocalRedirectPolicy:

cilium upgrade -f values.yaml --version 1.14.5

I ran this a few different times with slightly different settings, resulting in the addition of updateStragety.type: OnDelete in my production cluster, but it didn't seem like the actual changes in my values.yaml were the issue at all.

Just for the sake of completeness, here's the final version of that values.yaml that I tried to apply to my production cluster:

eni:
  awsEnablePrefixDelegation: true
affinity:
  nodeAffinity: # added to prevent conflicts with aws-node
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: io.cilium/aws-node-enabled
          operator: NotIn
          values:
          - 'true'
# localRedirectPolicy: true
# rollOutCiliumPods: true
updateStrategy:
  type: OnDelete

I've never set most of the Helm values that ended up in the staging cluster, they were applied automatically by cilium install -f values.yaml. But for some reason cilium upgrade -f values.yaml doesn't apply those same default Helm values, at least for EKS.

brb added the kind/bug Something isn't working label Dec 19, 2023

brb changed the title ~~Upgrade doesn't work properly on EKS~~ Upgrade doesn't work on EKS Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade doesn't work on EKS #2197

Upgrade doesn't work on EKS #2197

brb commented Dec 19, 2023

dqsully commented Dec 21, 2023 •

edited

Upgrade doesn't work on EKS #2197

Upgrade doesn't work on EKS #2197

Comments

brb commented Dec 19, 2023

dqsully commented Dec 21, 2023 • edited

dqsully commented Dec 21, 2023 •

edited