-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NicReservedForAnotherVM Error leaves nics in MC Resource Group #92
Comments
Related to #67, this is the remaining edge case to nic cleanup |
Notes and thought dump of recent discussionsCan we leverage Retryable Error which was recently added to delete these nics after the 180s it takes to retry the operation? Does it make sense to have the nodeclaim stick around? It might make more sense to have a background gc queue to retry any failed resource deletion attempts. This way we don't keep an empty nodeclaim with no vm representation around. The finalizer isn't removed from the node until we have properly deleted the nodeclaim. A background queue that queues entries from failed deletion attempts and attempts to delete them after might make sense for the nic case. Rather than blocking the nodeclaim from being removed for 180 seconds it takes to allow us to retry the nic deletion operation. |
Version
v0.2.0
Expected Behavior
When karpenter sends a VM POST Request to create a vm, and that post request fails, karpenter should always clean up the leftover resources from that failed attempt.
Actual Behavior
When karpenter sends a VM Post Request to create a vm, and that post request fails(Due to quota or various other issues), there is a race conditon between the arm representation of the vm and nic deletion.
When we attempt to issue a delete call for the network interface, we will get an error
NicReservedForAnotherVM
. This often occurs because the arm representation of the vm we tried to create isn't yet deleted when we issue the network interface deletion call.This results in the end delete call failing saying that the nic is reserved for another vm.
One can attempt to fix this by
Steps to Reproduce the Problem
make az-perftest-300, essentially you just need to scale up to a high volume.
Resource Specs and Logs
{"level":"ERROR","time":"2024-01-10T23:27:06.884Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine "aks-default-bzzjc" failed: PUT https://management.azure.com/subscriptions//resourceGroups/MC_blah/providers/Microsoft.Compute/virtualMachines/aks-default-bzzjc\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n "error": {\n "code": "OperationNotAllowed",\n "message": "Operation could not be completed as it results in exceeding approved Total Regional Cores quota. Additional details - Deployment Model: Resource Manager, Location: uksouth, Current Limit: 100, Current Usage: 60, Additional Required: 48, (Minimum) New Limit Required: 108. Submit a request for Quota increase at by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/regional-quota-requests\"\n }\n}\n--------------------------------------------------------------------------------\n","commit":"832597b-dirty","nodeclaim":"default-bzzjc","nodepool":"default"}
Community Note
The text was updated successfully, but these errors were encountered: