Improve our system of paired deployments / operations #1668
Labels
Engineering:SRE
Cloud infrastructure operations and development.
Enhancement
An improvement to something or creating something new.
Context
We recently had an incident that occurred because of a mistake made while decommissioning some cloud infrastructure, reported in:
The error that we made was that we incorrectly decommissioned the cluster, and didn't double-check that it was entirely shut down. As a result it started accruing cloud costs in the background. Because these cloud costs weren't too high, it went unnoticed for some time.
We should expect that our team will make mistakes like this - it is normal human nature. To reduce the risk associated with it, we should have a system of team checks that make us more likely to catch these kinds of issues in the future.
Proposal
I propose that we implement a system of paired deployments whenever we perform an operation in the cloud infrastructure. The goal of paired deployments is to:
This could be done either synchronously (by having live paired deployment sessions) or asynchronously (by having two team members assigned on an issue, and asking each of them to confirm that it has been completed as expected).
Updates and actions
No response
The text was updated successfully, but these errors were encountered: