Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade doesn't work on EKS #2197

Open
brb opened this issue Dec 19, 2023 · 1 comment
Open

Upgrade doesn't work on EKS #2197

brb opened this issue Dec 19, 2023 · 1 comment
Labels
kind/bug Something isn't working

Comments

@brb
Copy link
Member

brb commented Dec 19, 2023

Installed Cilium with CLI:

 cilium install  --chart-directory=./install/kubernetes/cilium/ --helm-set=debug.enabled=true --helm-set=bpf.monitorAggregation=none --helm-set=image.repository=quay.io/cilium/cilium-ci --helm-set=image.useDigest=false --helm-set=image.tag=v1.15 --helm-set=operator.image.repository=quay.io/cilium/operator --helm-set=operator.image.suffix=-ci --helm-set=operator.image.tag=v1.15 --helm-set=operator.image.useDigest=false --helm-set=clustermesh.apiserver.image.repository=quay.io/cilium/clustermesh-apiserver-ci --helm-set=clustermesh.apiserver.image.tag=v1.15 --helm-set=clustermesh.apiserver.image.useDigest=false --helm-set=hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci --helm-set=hubble.relay.image.tag=v1.15 --helm-set=hubble.relay.image.useDigest=false --cluster-name=cilium-cilium-7249674943-1 --helm-set=hubble.relay.enabled=true --helm-set loadBalancer.l7.backend=envoy --helm-set tls.secretsBackend=k8s --helm-set=bpf.monitorAggregation=none --wait=false

It auto-detected that installing on EKS:

🔮 Auto-detected Kubernetes kind: EKS
ℹ️  Using Cilium version 1.16.0
ℹ️  Using cluster name "cilium-cilium-7249674943-1"
🔮 Auto-detected kube-proxy has been installed

Then decided to upgrade with:

 cilium upgrade  --chart-directory=./install/kubernetes/cilium --helm-set=debug.enabled=true --helm-set=bpf.monitorAggregation=none --helm-set=image.repository=quay.io/cilium/cilium-ci --helm-set=image.useDigest=false --helm-set=image.tag=74c8b3fae3f23111561f7868b07e137a54eb721c --helm-set=operator.image.repository=quay.io/cilium/operator --helm-set=operator.image.suffix=-ci --helm-set=operator.image.tag=74c8b3fae3f23111561f7868b07e137a54eb721c --helm-set=operator.image.useDigest=false --helm-set=clustermesh.apiserver.image.repository=quay.io/cilium/clustermesh-apiserver-ci --helm-set=clustermesh.apiserver.image.tag=74c8b3fae3f23111561f7868b07e137a54eb721c --helm-set=clustermesh.apiserver.image.useDigest=false --helm-set=hubble.relay.image.repository=quay.io/cilium/hubble-relay-ci --helm-set=hubble.relay.image.tag=74c8b3fae3f23111561f7868b07e137a54eb721c --helm-set=hubble.relay.image.useDigest=false --cluster-name=cilium-cilium-7249674943-1 --helm-set=hubble.relay.enabled=true --helm-set loadBalancer.l7.backend=envoy --helm-set tls.secretsBackend=k8s --helm-set=bpf.monitorAggregation=none
... 
Flag --cluster-name has been deprecated, This can now be overridden via `--set` (Helm value: `cluster.name`).
🔮 Auto-detected Kubernetes kind: EKS
ℹ️  Using Cilium version 1.16.0
ℹ️  Using cluster name "cilium-cilium-7249674943-1"
🔮 Auto-detected kube-proxy has been installed

However, after the upgrade, the ipam / endoint-routes / egress-masquerade-interface / routing-mode / etc values were set to defaults (i.e., cluster-pool, disabled, nil, tunnel, etc), which broke the cluster.

@brb brb added the kind/bug Something isn't working label Dec 19, 2023
@brb brb changed the title Upgrade doesn't work properly on EKS Upgrade doesn't work on EKS Dec 19, 2023
@dqsully
Copy link

dqsully commented Dec 21, 2023

I just had this happen to me in production yesterday, breaking everything for 1h30m until I finished switching back to AWS VPC CNI. I even tried to revert the changes to the cilium-config ConfigMap, but they kept getting overwritten.

When I list the Cilium Helm values for the affected production k8s cluster (using helm status cilium -n kube-system -o json | jq .config), I get

{
  "affinity": {
    "nodeAffinity": {
      "requiredDuringSchedulingIgnoredDuringExecution": {
        "nodeSelectorTerms": [
          {
            "matchExpressions": [
              {
                "key": "io.cilium/aws-node-enabled",
                "operator": "NotIn",
                "values": [
                  "true"
                ]
              }
            ]
          }
        ]
      }
    }
  },
  "eni": {
    "awsEnablePrefixDelegation": true
  },
  "updateStrategy": {
    "type": "OnDelete"
  }
}

But when I list the values for an unaffected staging cluster created with nearly-identical Helm values, I get

{
  "affinity": {
    "nodeAffinity": {
      "requiredDuringSchedulingIgnoredDuringExecution": {
        "nodeSelectorTerms": [
          {
            "matchExpressions": [
              {
                "key": "io.cilium/aws-node-enabled",
                "operator": "NotIn",
                "values": [
                  "true"
                ]
              }
            ]
          }
        ]
      }
    }
  },
  "cluster": {
    "name": "<cluster name>"
  },
  "egressMasqueradeInterfaces": "eth0",
  "eni": {
    "awsEnablePrefixDelegation": true,
    "enabled": true
  },
  "hubble": {
    "relay": {
      "enabled": true
    },
    "ui": {
      "enabled": true
    }
  },
  "ipam": {
    "mode": "eni"
  },
  "operator": {
    "replicas": 1
  },
  "routingMode": "native",
  "serviceAccounts": {
    "cilium": {
      "name": "cilium"
    },
    "operator": {
      "name": "cilium-operator"
    }
  }
}

In both clusters, I've run the commands:

  • cilium install -f values.yaml --version 1.14.4
  • cilium upgrade --version 1.14.5

But this is the command that broke my production cluster while I was trying to enable LocalRedirectPolicy:

  • cilium upgrade -f values.yaml --version 1.14.5

I ran this a few different times with slightly different settings, resulting in the addition of updateStragety.type: OnDelete in my production cluster, but it didn't seem like the actual changes in my values.yaml were the issue at all.

Just for the sake of completeness, here's the final version of that values.yaml that I tried to apply to my production cluster:

eni:
  awsEnablePrefixDelegation: true
affinity:
  nodeAffinity: # added to prevent conflicts with aws-node
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: io.cilium/aws-node-enabled
          operator: NotIn
          values:
          - 'true'
# localRedirectPolicy: true
# rollOutCiliumPods: true
updateStrategy:
  type: OnDelete

I've never set most of the Helm values that ended up in the staging cluster, they were applied automatically by cilium install -f values.yaml. But for some reason cilium upgrade -f values.yaml doesn't apply those same default Helm values, at least for EKS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants