Intermittent Azure API fault results in zombie NatGateway and persistent shoot creation failure #678
Labels
area/robustness
Robustness, reliability, resilience related
kind/bug
Bug
lifecycle/rotten
Nobody worked on this for 12 months (final aging stage)
platform/azure
Microsoft Azure platform/infrastructure
How to categorize this issue?
/area robustness
/kind bug
/platform azure
This ticket tracks an issue for which a short term technical solution is not possible. It has however caused both substantial pain and perception of poor Gardener robustness to one or more customers. A customer experiences multiple persistent shoot creation failures, and are forced to perform manual cleanup of infrastructure objects created by Gardener. The goal of this ticket is to communicate customer impact, and potentially drive/inform a longer-term change.
What happened:
In the context of a shoot creation workflow, Azure reported a NatGateway creation failure due to throttling, and created a NatGateway object with failed state. Terraform did not adopt the newly created gateway. The gateway object was abandoned as a zombie which would not be deleted by Gardener, and whose clashing name disrupts further attempts by Gardener to create a NatGateway required as part of shoot creation. The outcome is a shoot with persistently failed creation, plus infrastructure object which requires manual cleanup.
The presumed Azure throttling restriction is subscription-specific, so an occurrence affects a single Gardener customer, but in an automated scenario, is likely to result in multiple failed shoots for that customer.
The problem cannot be immediately resolved in Gardener, because the underlying cause, as currently understood, is a conflict between Azure's failure mode in that specific scenario, and Terraform. TBD: A more precise description of these underlying mechanics is to be added to this ticket shortly.
Anything else we need to know?:
Environment:
kubectl version
):The text was updated successfully, but these errors were encountered: