Improve our system of paired deployments / operations #1668

choldgraf · 2022-08-31T06:16:19Z

Context

We recently had an incident that occurred because of a mistake made while decommissioning some cloud infrastructure, reported in:

[Incident] CarbonPlan AWS hub had running infrastructure we didn't track #1666

The error that we made was that we incorrectly decommissioned the cluster, and didn't double-check that it was entirely shut down. As a result it started accruing cloud costs in the background. Because these cloud costs weren't too high, it went unnoticed for some time.

We should expect that our team will make mistakes like this - it is normal human nature. To reduce the risk associated with it, we should have a system of team checks that make us more likely to catch these kinds of issues in the future.

Proposal

I propose that we implement a system of paired deployments whenever we perform an operation in the cloud infrastructure. The goal of paired deployments is to:

Provide at least two pairs of eyes to double-check work
Provide assistance and support when debugging and changing infrastructure
Provide an opportunity to learn and share knowledge among the team

This could be done either synchronously (by having live paired deployment sessions) or asynchronously (by having two team members assigned on an issue, and asking each of them to confirm that it has been completed as expected).

Updates and actions

No response

yuvipanda · 2024-07-01T23:19:12Z

Handled by various other improvements in our processes.

choldgraf added Enhancement An improvement to something or creating something new. Engineering:SRE Cloud infrastructure operations and development. labels Aug 31, 2022

choldgraf mentioned this issue Aug 31, 2022

[Incident] CarbonPlan AWS hub had running infrastructure we didn't track #1666

Closed

5 tasks

damianavila added this to DEPRECATED Engineering and Product Backlog Sep 12, 2022

damianavila moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Sep 13, 2022

yuvipanda closed this as completed Jul 1, 2024

github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve our system of paired deployments / operations #1668

Improve our system of paired deployments / operations #1668

choldgraf commented Aug 31, 2022

yuvipanda commented Jul 1, 2024

Improve our system of paired deployments / operations #1668

Improve our system of paired deployments / operations #1668

Comments

choldgraf commented Aug 31, 2022

Context

Proposal

Updates and actions

yuvipanda commented Jul 1, 2024