[FR] Enable operators to "cure" bugged instances using Terraform #417

mogul · 2022-03-10T18:39:04Z

Is your feature request related to a problem? Please describe.

When a brokered instance is in a bad state, the only way offered by the CSB to recover it is to deprovision and reprovision it, often losing all the data/state contained in it. The upcoming change to allow for updating HCL code during an update operation is great, but we expect that it won't cover all the occasions when, for example, an operator might need to tinker with the Terraform state to fix problems (eg migrating resources between providers).

Describe the solution you'd like

Given this recent change that makes the rehydration of Terraform workspaces a separate process from what commands will be run there, we would like two new client commands:

$(CSB_EXEC) tf exec <INSTANCE> <TERRAFORM-ARGS> - Do the normal workspace rehydration, then run Terraform in the workspace with the commands specified. Display the output as if terraform was run locally.
$(CSB_EXEC) tf state-[pull|push] <INSTANCE> - This would be the equivalent of terraform state pull and terraform state push. Essentially this would enable an operator to do operations outside of the CSB, and then inform the CSB of the new state that results.

Describe alternatives you've considered

We have tried to accomplish this by looking in /proc for the working directory of a terraform operation the CSB has underway but there's no way to tell the CSB to stop and give us a chance to work with that environment... The best we've been able to do is set TF_LOG and TF_LOG_PATH environment variables for the CSB to at least give us a chance to see what's going on during the operation.

Additional Context

Priority

High

We are constantly nervous that we will only be able to destroy brokered instances, whereas statically-deployed instances give us the opportunity to try to get the instance working again. This is making us reluctant to use the broker as intended.

We have a service that are expensive to provision/deprovision (EKS, 20 minutes) and a service built on top of that one that's expensive to repopulate with data (Solr, 3+ days). Having to destroy/recreate rather than "cure" existing instances can be deadly to our product's availability.

Priority Context

With the new feature to update HCL code during an update, there may be unforeseen complications when brokerpak authors start using the feature. This will enable them to move forward with existing instances even as those complications are addressed in the CSB.

Platform

N/A

Applicable Services

It applies to EKS and Solr, for which we maintain brokerpaks.

The text was updated successfully, but these errors were encountered:

cf-gitbot · 2022-03-10T18:39:06Z

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/181529116

The labels on this github issue will be updated when the story is started.

tinygrasshopper · 2022-03-11T16:07:20Z

A couple of points:
In my mind an interface like tf open-the-hood and tf close-the-hood is really risky as the user might not be aware of all the assumptions the CSB making(and will make in the future) around how its storing terraform in its DB. It also leaves the door open for a classs of errors when folks forget to "close the hood" after changing the remote infrastructure.

A contract like csb tf exec <tf-if> <command> might be better fit for purpose, the csb utility then is responsible for hydrating the workspace, running the command and cleaning up the workspace.

Secondly, CSB currently relies on mutex locking within the workspace to ensure different processes dont modify the same workspace, if we implemented the feature above, we would have to use another method to ensure two processes are not modifying the cloud provider at the same time.

mogul · 2022-03-14T15:48:53Z

That option ("run this tf command") would be fine as well; we just need a way to understand and fix what's going on. Sometimes that requires terraform state list or terraform apply -replace foo, things we have no way to do now.

tinygrasshopper · 2022-03-18T13:54:23Z

I think the feature overall makes a lot of sense. Its not in the project's upcoming roadmap, but I think we should add that feature to CSB if you can create a pull request.

A couple of things to keep in mind of the pull request:

You should add an integration test for this functionality to ensure we dont break it on subsequent changes.
Some documentation on the command to ensure CSB serve is not running when this debugging/curing progress to sidestep the mutex related issues

mogul · 2022-03-23T14:45:11Z

Sadly I'm not enough of a Go programmer to attempt this! 😞

mogul · 2022-03-30T21:54:20Z

Note I've updated the original post to reflect this discussion, adding just csb tf exec.

However, I also added another potential command, csb tf state-[pull|push]. These would enable operators to mutate state outside of the CSB, and then reset its notion of reality. This corresponds to terraform state pull and terraform state push. (You might consider this a simpler alternative oeprator experience for doing a subsume.)

mogul added the enhancement New feature or request label Mar 10, 2022

cf-gitbot added the unscheduled label Mar 10, 2022

mogul changed the title ~~[FR] Provision a workspace for operators to recover bugged instances~~ [FR] Provision a workspace for operators to "cure" bugged instances Mar 10, 2022

pivotal-marcela-campo assigned tinygrasshopper Mar 18, 2022

pivotal-marcela-campo added help wanted The team has de-prioritized this and could use your help! and removed unscheduled labels Mar 24, 2022

mogul changed the title ~~[FR] Provision a workspace for operators to "cure" bugged instances~~ [FR] Enable operators to "cure" bugged instances using Terraform Mar 30, 2022

pivotal-marcela-campo unassigned tinygrasshopper Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Enable operators to "cure" bugged instances using Terraform #417

[FR] Enable operators to "cure" bugged instances using Terraform #417

mogul commented Mar 10, 2022 •

edited

Loading

cf-gitbot commented Mar 10, 2022

tinygrasshopper commented Mar 11, 2022

mogul commented Mar 14, 2022

tinygrasshopper commented Mar 18, 2022

mogul commented Mar 23, 2022

mogul commented Mar 30, 2022 •

edited

Loading

[FR] Enable operators to "cure" bugged instances using Terraform #417

[FR] Enable operators to "cure" bugged instances using Terraform #417

Comments

mogul commented Mar 10, 2022 • edited Loading

cf-gitbot commented Mar 10, 2022

tinygrasshopper commented Mar 11, 2022

mogul commented Mar 14, 2022

tinygrasshopper commented Mar 18, 2022

mogul commented Mar 23, 2022

mogul commented Mar 30, 2022 • edited Loading

mogul commented Mar 10, 2022 •

edited

Loading

mogul commented Mar 30, 2022 •

edited

Loading