NicReservedForAnotherVM Error leaves nics in MC Resource Group #92

Bryce-Soghigian · 2024-01-17T09:14:45Z

Version

v0.2.0

Expected Behavior

When karpenter sends a VM POST Request to create a vm, and that post request fails, karpenter should always clean up the leftover resources from that failed attempt.

Actual Behavior

When karpenter sends a VM Post Request to create a vm, and that post request fails(Due to quota or various other issues), there is a race conditon between the arm representation of the vm and nic deletion.

When we attempt to issue a delete call for the network interface, we will get an error NicReservedForAnotherVM. This often occurs because the arm representation of the vm we tried to create isn't yet deleted when we issue the network interface deletion call.

This results in the end delete call failing saying that the nic is reserved for another vm.

One can attempt to fix this by

Disassociating the nic and retrying deletion(Preferred)
Having controller retries for network interface deletion(This blocks the nodeclaim from failing fast so not prefered)
having another garbage collection controller for dealing with ghost nics.

Steps to Reproduce the Problem

make az-perftest-300, essentially you just need to scale up to a high volume.

Resource Specs and Logs

VM Create Fails
{"level":"ERROR","time":"2024-01-10T23:27:06.884Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine "aks-default-bzzjc" failed: PUT https://management.azure.com/subscriptions//resourceGroups/MC_blah/providers/Microsoft.Compute/virtualMachines/aks-default-bzzjc\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n "error": {\n "code": "OperationNotAllowed",\n "message": "Operation could not be completed as it results in exceeding approved Total Regional Cores quota. Additional details - Deployment Model: Resource Manager, Location: uksouth, Current Limit: 100, Current Usage: 60, Additional Required: 48, (Minimum) New Limit Required: 108. Submit a request for Quota increase at by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/regional-quota-requests\"\n }\n}\n--------------------------------------------------------------------------------\n","commit":"832597b-dirty","nodeclaim":"default-bzzjc","nodepool":"default"}
Deletion of network interface fails due to nic reserved
{"level":"ERROR","time":"2024-01-10T23:28:07.848Z","logger":"controller.nodeclaim.lifecycle","message":"networkInterface.Delete for aks-default-bzzjc failed: DELETE https://management.azure.com/subscriptions/redacted/resourceGroups/redacted/providers/Microsoft.Network/networkInterfaces/aks-default-bzzjc\n--------------------------------------------------------------------------------\nRESPONSE 400: 400 Bad Request\nERROR CODE: NicReservedForAnotherVm\n--------------------------------------------------------------------------------\n{\n "error": {\n "code": "NicReservedForAnotherVm",\n "message": "Nic(s) in request is reserved for another Virtual Machine for 180 seconds. Please provide another nic(s) or retry after 180 seconds. Reserved VM: /subscriptions/redacted/resourceGroups/M/providers/Microsoft.Compute/virtualMachines/aks-default-bzzjc",\n "details": []\n }\n}\n--------------------------------------------------------------------------------\n","commit":"832597b-dirty","nodeclaim":"default-bzzjc","nodepool":"default"}

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

Bryce-Soghigian · 2024-01-17T09:28:33Z

Related to #67, this is the remaining edge case to nic cleanup

Bryce-Soghigian · 2024-04-18T07:34:21Z

Notes and thought dump of recent discussions

Can we leverage Retryable Error which was recently added to delete these nics after the 180s it takes to retry the operation?

Does it make sense to have the nodeclaim stick around? It might make more sense to have a background gc queue to retry any failed resource deletion attempts. This way we don't keep an empty nodeclaim with no vm representation around. The finalizer isn't removed from the node until we have properly deleted the nodeclaim.

A background queue that queues entries from failed deletion attempts and attempts to delete them after might make sense for the nic case. Rather than blocking the nodeclaim from being removed for 180 seconds it takes to allow us to retry the nic deletion operation.

Bryce-Soghigian added area/networking Issues or PRs related to networking needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 17, 2024

tallaxes added triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/bug Categorizes issue or PR as related to a bug. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 17, 2024

Bryce-Soghigian self-assigned this May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NicReservedForAnotherVM Error leaves nics in MC Resource Group #92

NicReservedForAnotherVM Error leaves nics in MC Resource Group #92

Bryce-Soghigian commented Jan 17, 2024

Bryce-Soghigian commented Jan 17, 2024

Bryce-Soghigian commented Apr 18, 2024

NicReservedForAnotherVM Error leaves nics in MC Resource Group #92

NicReservedForAnotherVM Error leaves nics in MC Resource Group #92

Comments

Bryce-Soghigian commented Jan 17, 2024

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

Bryce-Soghigian commented Jan 17, 2024

Bryce-Soghigian commented Apr 18, 2024

Notes and thought dump of recent discussions