Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler has a panic in NodeDeletionBatcher.AddNode #5891

Closed
com6056 opened this issue Jun 26, 2023 · 4 comments
Closed

cluster-autoscaler has a panic in NodeDeletionBatcher.AddNode #5891

com6056 opened this issue Jun 26, 2023 · 4 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@com6056
Copy link
Contributor

com6056 commented Jun 26, 2023

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:

Component version: v1.27.1

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:32:02Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:
AWS via cluster-api, cluster-api-provider-aws (MachinePool), and the aws cluster-autoscaler provider

What did you expect to happen?:
No panic

What happened instead?:
Panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x46d3b08]

goroutine 3346 [running]:
k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*NodeDeletionBatcher).AddNode(0xc0074eb900, 0xc0047c7340, 0xaf?)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation/delete_in_batch.go:77 +0xa8
k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*Actuator).scheduleDeletion(0xc0061a4320, 0xc000a247d0?, {0xc007493a70, 0xf}, 0x88?)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation/actuator.go:363 +0x85
created by k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation.(*Actuator).deleteNodesAsync
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/scaledown/actuation/actuator.go:296 +0xc95

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
We have 2 different cluster-autoscaler instances running, 1 to manage the MachineDeployment resources using the cluster-api provider and 1 to manage the MachinePool resources using the aws provider. This is a workaround until kubernetes-sigs/cluster-api-provider-aws#4184 is complete and the cluster-api provider can be used for MachinePool resources.

It looks like it might be due to calling nodeGroup.Id() in

CleanUpAndRecordFailedScaleDownEvent(d.ctx, node, nodeGroup.Id(), drain, d.nodeDeletionTracker, "", result)
when nodeGroup might be nil, but I haven't confirmed.

@com6056 com6056 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 26, 2023
@com6056
Copy link
Contributor Author

com6056 commented Jun 26, 2023

Made a pass at fixing it, still not 100% sure but I think it should hopefully be as simple as this: #5892

@vadasambar
Copy link
Member

vadasambar commented Jun 28, 2023

Thanks for the issue and the PR.

Can you reproduce the issue consistently?

I don't think CA supports running multiple instances in the same cluster. Maybe things are working because you are using two different cloud providers? I haven't seen a case where someone is running multiple CA instances. If the issue is hard to reproduce, I wonder if it's because of a race condition (I would expect some degree of separation in execution between 2 instances though)

@com6056
Copy link
Contributor Author

com6056 commented Jun 28, 2023

Yep, our CA container is restarting pretty consistently with this panic. I'm not sure if us running 2 CA instances has much to do with it, if you look at my PR with the proposed fix, it just looks like we don't have a nil check before trying to call nodeGroup.Id() so instead I just opted to pass through the known nodeGroupId that we already had earlier.

@com6056
Copy link
Contributor Author

com6056 commented Jun 28, 2023

deleteNodesFromCloudProvider calls NodeGroupForNode, which in the case of AWS is https://github.com/kubernetes/autoscaler/blob/8f83f7ec434eda955f176fcb8ad7153d90000477/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go#L108C1-L128 and can sometimes return nil in an error, so we shouldn't be trying to call a method on nodeGroup on an error since it could be nil.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants