Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOF Error from AWS api while validating cluster which was in running state #16548

Closed
teocrispy91 opened this issue May 9, 2024 · 18 comments
Closed

Comments

@teocrispy91
Copy link

We have a kops cluster with version 1.15.2 and everything was working fine until i did a helm upgrade deployment to one of my namespace in the cluster after that i can't run kubectl commands when i run it's showing "unable to connect to server EOF". also i have a dashboard hosted for kubernetes like example.com/dashboard that page is also showing 502 nginx error. when i checked the elb in aws its showing out of servivce but my master node is running. since we are not able to connect to cluster we couldn't identify the issue.

when i run kops validate cluster i am getting the below error.
unexpected error during validation: error listing nodes: Get https://MY_LOAD_BALANCER_DNS_NAME.us-west-2.elb.amazonaws.com/api/v1/nodes: EOF
with MY_LOAD_BALANCER_DNS_NAME replaced by the value under the "DNS name" field on the AWS console

also i can browse my applications hosted in the kops cluster not sure if apiserver is down or some issue with master.

It would be great help if someone could really help on this.

@hakman
Copy link
Member

hakman commented May 10, 2024

@teocrispy91 Could you share why using kOps v1.15.2 which is 4-5 years old when creating new clusters?
Please try to go through https://kops.sigs.k8s.io/operations/troubleshoot/. Should help understand where the problem comes from.

@teocrispy91
Copy link
Author

@hakman the cluster was created 4 years back and was running without much problems.

@hakman
Copy link
Member

hakman commented May 10, 2024

@hakman the cluster was created 4 years back and was running without much problems.

The title says "while validating new cluster" 😄

@teocrispy91 teocrispy91 changed the title EOF Error from AWS api while validating new cluster EOF Error from AWS api while validating cluster which was in running state May 10, 2024
@teocrispy91
Copy link
Author

@hakman sorry for the typo i have edited the same

@hakman
Copy link
Member

hakman commented May 10, 2024

No worries, the suggestion still stands, you need to look on the master nodes for logs.
Generally speaking, certs expire. Nodes have to be rotated once in a while at least.

@teocrispy91
Copy link
Author

teocrispy91 commented May 10, 2024

@hakman since i am new to kops willa restart to master node cause any issues? Also are you talking about api server cert?

@hakman
Copy link
Member

hakman commented May 10, 2024

I don't think that restarting the master node will do any damage, but probably it will not help much either.
Unless you SSH to the node and look for the issue in logs, this is just guesswork.
You should read the troubleshooting guide and check what happened.

@teocrispy91
Copy link
Author

teocrispy91 commented May 16, 2024

@hakman i just logged into my master node and while doing kubectl get ns or pods it's showing connection to server localhsot was refused port8080. when i do netstat i can see niether 443 or 8080 is opened in my master node will it be because of that. when running docker logs i could see my api-server pod restarting and going to exited state continously.

This is some log i can see inside the api-server pod.i have checked the cert they are valid
= "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...
W0515 13:17:42.270070 1 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {127.0.0.1:4001 0 }. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate has expired or is not yet valid". Reconnecting...

@teocrispy91 teocrispy91 closed this as not planned Won't fix, can't repro, duplicate, stale May 20, 2024
@hakman
Copy link
Member

hakman commented May 20, 2024

Most likely the etcd certs expired and API server cannot connect to it anymore.
This might have helped https://github.com/kubernetes/kops/blob/master/docs/advisories/etcd-manager-certificate-expiration.md.

@teocrispy91
Copy link
Author

teocrispy91 commented May 20, 2024

@hakman But etcd container seems to be running will it run if the cert has expired.

When i run the below command i can see it's showing up to march28th 2024. But the image version seems to be kopeio/etcd-manager:3.0.20200429 which is higher than the one mentioned.

find /mnt/ -type f -name me.crt -print -exec openssl x509 -enddate -noout -in {} ;

@hakman
Copy link
Member

hakman commented May 20, 2024

Seems so, but you have to do rolling updates from time to time on the cluster.
There is no mechanism dealing with cert rotation automatically.,

@teocrispy91
Copy link
Author

when i ran this command find /mnt/ -type f -name me.crt -print -exec openssl x509 -enddate -noout -in {} ; i could see that the certs has expired. This is the result i get

find /mnt/ -type f -name me.crt -print -exec openssl x509 -enddate -noout -in {} ;
/mnt/master-vol-01399aaec42e241cd/pki/etcd-cluster-token-etcd-events/peers/me.crt
notAfter=Mar 28 06:06:18 2024 GMT
/mnt/master-vol-0e73c716447126a30/pki/etcd-cluster-token-etcd/peers/me.crt
notAfter=Mar 28 06:07:30 2024 GMT

So how can i renew this what would be the next steps.

@hakman
Copy link
Member

hakman commented May 20, 2024

This may work:

kops rolling-update cluster --instance-group-roles=Master --force --cloudonly

@teocrispy91
Copy link
Author

@hakman this will recreate a new master node right doesn't upgrade the cluster? Also from where i need to run this in master i doubt whether kops command will work.

@hakman
Copy link
Member

hakman commented May 21, 2024

@hakman this will recreate a new master node right doesn't upgrade the cluster? Also from where i need to run this in master i doubt whether kops command will work.

You need to run it from your computer that has admin permissions on the AWS account that hosts the cluster, using the kOps v1.15.2 binary. It will destroy and re-create the master.
Similarly, you can terminate the master instance and it will be re-created.

@teocrispy91
Copy link
Author

@hakman so terminating controlplane ec2 instance and it will be created automatically right by the autoscaling

@hakman
Copy link
Member

hakman commented May 21, 2024

@hakman so terminating controlplane ec2 instance and it will be created automatically right by the autoscaling

yes

@teocrispy91
Copy link
Author

@hakman Thanks a ton. After running the command you mentioned cluster seems to be up now. also in kube-system my aws-iam-authenticator pod is in imgpullback (do i need to update with latest image)also metrics pod is in crashloopback any idea why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants