CoreDNS timeout on vSphere cluster when resolve a service #8144

ygao-armada · 2024-05-11T19:43:13Z

What happened:
In EKSA cluster for vSphere, we have a strange error, on worker node, if we replace the /etc/resolv.conf with that from pod argocd-server-xxx:

search argocd.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.192.10
options ndots:5

The nslook up command will resolve the IP (10.96.221.1) first, then wait for 10 seconds til timeout

root@mgmt20-md-0-7k7hk-vcnh2:/home/ec2-user# nslookup argocd-redis
Server:     10.96.192.10
Address:    10.96.192.10#53

Name:  argocd-redis.argocd.svc.cluster.local
Address: 10.96.221.1
;; connection timed out; no servers could be reached


root@mgmt20-md-0-7k7hk-vcnh2:/home/ec2-user# exit

We can see the IP (10.96.221.1) is correct as follows:

ubuntu@ubuntuguest:~$ kubectl get svc -A -o wide | grep 10.96.221.1
argocd               argocd-redis                         ClusterIP  10.96.221.1   <none>    6379/TCP            135m  app.kubernetes.io/name=argocd-redis

And 10.96.192.10 is the coredns IP:

ubuntu@ubuntuguest:~$ kubectl get svc -A -o wide | grep 10.96.192.10
kube-system             kube-dns                           ClusterIP  10.96.192.10  <none>    53/UDP,53/TCP,9153/TCP     103d  k8s-app=kube-dns

Am I missing something?

What you expected to happen:
No timeout should happen for command "nslookup argocd-redis"

How to reproduce it (as minimally and precisely as possible):
Install argoCD on a EKSA vSphere cluster, and take the steps in above description.

Anything else we need to know?:

Environment:

EKS Anywhere Release:
EKS Distro Release:

The text was updated successfully, but these errors were encountered:

sp1999 · 2024-05-12T09:20:13Z

Thanks for reporting @ygao-armada. We are looking into this issue and will get back with any information we find.

ygao-armada · 2024-05-13T09:35:06Z

@sp1999 Some update, I find it's related to gpu-operator, look like, if we install argocd before gpu-operator, there is no such issue.
And I install argocd with:

kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

And I install gpu-operator with instruction from: https://github.com/NVIDIA/gpu-operator/blob/release-23.9/scripts/install-gpu-operator-nvaie.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreDNS timeout on vSphere cluster when resolve a service #8144

CoreDNS timeout on vSphere cluster when resolve a service #8144

ygao-armada commented May 11, 2024 •

edited

sp1999 commented May 12, 2024

ygao-armada commented May 13, 2024

CoreDNS timeout on vSphere cluster when resolve a service #8144

CoreDNS timeout on vSphere cluster when resolve a service #8144

Comments

ygao-armada commented May 11, 2024 • edited

sp1999 commented May 12, 2024

ygao-armada commented May 13, 2024

ygao-armada commented May 11, 2024 •

edited