[investigate] Rollback Flow Failure Points Not Properly Handled

Meg came up with this and I want to ensure the metald side has these facilities in place to assist.



During the rollback operation, the following failure points need to be addressed to ensure robust error handling:

1. **Step 1 (DB Check)**: If checking the partition DB for VMs fails due to connectivity issues, the RPC should return a failure response immediately, and no changes should be made to the system state.

2. **Step 3 (VM Provisioning)**: If VM provisioning fails (e.g., insufficient capacity), the RPC should return a failure response immediately, preventing hostname switching and maintaining the current active deployment.

3. **Step 4 (Hostname Switching)**: If the database transaction fails during hostname switching, it should trigger an automatic rollback. Successfully booted VMs should remain running for future requests, the RPC should return a failure response, and no traffic routing changes should occur.

We need to ensure that logging is implemented effectively in these failure points for easier debugging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[investigate] Rollback Flow Failure Points Not Properly Handled #3892

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[investigate] Rollback Flow Failure Points Not Properly Handled #3892

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions