Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NicReservedForAnotherVM Error leaves nics in MC Resource Group #92

Open
Bryce-Soghigian opened this issue Jan 17, 2024 · 2 comments
Open
Assignees
Labels
area/networking Issues or PRs related to networking kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Bryce-Soghigian
Copy link
Contributor

Version

v0.2.0

Expected Behavior

When karpenter sends a VM POST Request to create a vm, and that post request fails, karpenter should always clean up the leftover resources from that failed attempt.

Actual Behavior

When karpenter sends a VM Post Request to create a vm, and that post request fails(Due to quota or various other issues), there is a race conditon between the arm representation of the vm and nic deletion.

When we attempt to issue a delete call for the network interface, we will get an error NicReservedForAnotherVM. This often occurs because the arm representation of the vm we tried to create isn't yet deleted when we issue the network interface deletion call.

This results in the end delete call failing saying that the nic is reserved for another vm.

One can attempt to fix this by

  1. Disassociating the nic and retrying deletion(Preferred)
  2. Having controller retries for network interface deletion(This blocks the nodeclaim from failing fast so not prefered)
  3. having another garbage collection controller for dealing with ghost nics.

Steps to Reproduce the Problem

make az-perftest-300, essentially you just need to scale up to a high volume.

Resource Specs and Logs

  1. VM Create Fails
    {"level":"ERROR","time":"2024-01-10T23:27:06.884Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine "aks-default-bzzjc" failed: PUT https://management.azure.com/subscriptions//resourceGroups/MC_blah/providers/Microsoft.Compute/virtualMachines/aks-default-bzzjc\n--------------------------------------------------------------------------------\nRESPONSE 409: 409 Conflict\nERROR CODE: OperationNotAllowed\n--------------------------------------------------------------------------------\n{\n "error": {\n "code": "OperationNotAllowed",\n "message": "Operation could not be completed as it results in exceeding approved Total Regional Cores quota. Additional details - Deployment Model: Resource Manager, Location: uksouth, Current Limit: 100, Current Usage: 60, Additional Required: 48, (Minimum) New Limit Required: 108. Submit a request for Quota increase at by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/regional-quota-requests\"\n }\n}\n--------------------------------------------------------------------------------\n","commit":"832597b-dirty","nodeclaim":"default-bzzjc","nodepool":"default"}
  2. Deletion of network interface fails due to nic reserved
  3. {"level":"ERROR","time":"2024-01-10T23:28:07.848Z","logger":"controller.nodeclaim.lifecycle","message":"networkInterface.Delete for aks-default-bzzjc failed: DELETE https://management.azure.com/subscriptions/redacted/resourceGroups/redacted/providers/Microsoft.Network/networkInterfaces/aks-default-bzzjc\n--------------------------------------------------------------------------------\nRESPONSE 400: 400 Bad Request\nERROR CODE: NicReservedForAnotherVm\n--------------------------------------------------------------------------------\n{\n "error": {\n "code": "NicReservedForAnotherVm",\n "message": "Nic(s) in request is reserved for another Virtual Machine for 180 seconds. Please provide another nic(s) or retry after 180 seconds. Reserved VM: /subscriptions/redacted/resourceGroups/M/providers/Microsoft.Compute/virtualMachines/aks-default-bzzjc",\n "details": []\n }\n}\n--------------------------------------------------------------------------------\n","commit":"832597b-dirty","nodeclaim":"default-bzzjc","nodepool":"default"}

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@Bryce-Soghigian Bryce-Soghigian added area/networking Issues or PRs related to networking needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 17, 2024
@Bryce-Soghigian
Copy link
Contributor Author

Related to #67, this is the remaining edge case to nic cleanup

@tallaxes tallaxes added triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/bug Categorizes issue or PR as related to a bug. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 17, 2024
@Bryce-Soghigian
Copy link
Contributor Author

Notes and thought dump of recent discussions

Can we leverage Retryable Error which was recently added to delete these nics after the 180s it takes to retry the operation?

Does it make sense to have the nodeclaim stick around? It might make more sense to have a background gc queue to retry any failed resource deletion attempts. This way we don't keep an empty nodeclaim with no vm representation around. The finalizer isn't removed from the node until we have properly deleted the nodeclaim.

A background queue that queues entries from failed deletion attempts and attempts to delete them after might make sense for the nic case. Rather than blocking the nodeclaim from being removed for 180 seconds it takes to allow us to retry the nic deletion operation.

@Bryce-Soghigian Bryce-Soghigian self-assigned this May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Issues or PRs related to networking kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

2 participants