Skip to content

[investigate] Rollback Flow Failure Points Not Properly Handled #3892

@imeyer

Description

@imeyer

Meg came up with this and I want to ensure the metald side has these facilities in place to assist.

During the rollback operation, the following failure points need to be addressed to ensure robust error handling:

  1. Step 1 (DB Check): If checking the partition DB for VMs fails due to connectivity issues, the RPC should return a failure response immediately, and no changes should be made to the system state.

  2. Step 3 (VM Provisioning): If VM provisioning fails (e.g., insufficient capacity), the RPC should return a failure response immediately, preventing hostname switching and maintaining the current active deployment.

  3. Step 4 (Hostname Switching): If the database transaction fails during hostname switching, it should trigger an automatic rollback. Successfully booted VMs should remain running for future requests, the RPC should return a failure response, and no traffic routing changes should occur.

We need to ensure that logging is implemented effectively in these failure points for easier debugging.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions