Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve our system of paired deployments / operations #1668

Closed
choldgraf opened this issue Aug 31, 2022 · 1 comment
Closed

Improve our system of paired deployments / operations #1668

choldgraf opened this issue Aug 31, 2022 · 1 comment
Labels
Engineering:SRE Cloud infrastructure operations and development. Enhancement An improvement to something or creating something new.

Comments

@choldgraf
Copy link
Member

Context

We recently had an incident that occurred because of a mistake made while decommissioning some cloud infrastructure, reported in:

The error that we made was that we incorrectly decommissioned the cluster, and didn't double-check that it was entirely shut down. As a result it started accruing cloud costs in the background. Because these cloud costs weren't too high, it went unnoticed for some time.

We should expect that our team will make mistakes like this - it is normal human nature. To reduce the risk associated with it, we should have a system of team checks that make us more likely to catch these kinds of issues in the future.

Proposal

I propose that we implement a system of paired deployments whenever we perform an operation in the cloud infrastructure. The goal of paired deployments is to:

  • Provide at least two pairs of eyes to double-check work
  • Provide assistance and support when debugging and changing infrastructure
  • Provide an opportunity to learn and share knowledge among the team

This could be done either synchronously (by having live paired deployment sessions) or asynchronously (by having two team members assigned on an issue, and asking each of them to confirm that it has been completed as expected).

Updates and actions

No response

@choldgraf choldgraf added Enhancement An improvement to something or creating something new. Engineering:SRE Cloud infrastructure operations and development. labels Aug 31, 2022
@damianavila damianavila moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Sep 13, 2022
@yuvipanda
Copy link
Member

Handled by various other improvements in our processes.

@github-project-automation github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Engineering:SRE Cloud infrastructure operations and development. Enhancement An improvement to something or creating something new.
Projects
No open projects
Development

No branches or pull requests

2 participants