Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: [Zero-Downtime] - safe migration #1890

Open
2 tasks
Tomasz-Smelcerz-SAP opened this issue Sep 25, 2024 · 0 comments
Open
2 tasks

feat: [Zero-Downtime] - safe migration #1890

Tomasz-Smelcerz-SAP opened this issue Sep 25, 2024 · 0 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@Tomasz-Smelcerz-SAP
Copy link
Member

Tomasz-Smelcerz-SAP commented Sep 25, 2024

Description

In order to introduce the zero-downtime procedure, we need a safe migration path.
By "safe migration path" I understand a setup, where a "revert" to the old behavior is as simple as possible - in case there is a bug in the new solution or it is not working as expected for whatever reason.
In particular, the revert should be as simple as switching some Lifecycle-Manager flags (runtime arguments) - the less, the better.

An Idea for how this could work:

The current solution uses root certificate secret as the Istio-Gateway secret directly.
Once the secret is rotated, the LM code deletes the client secrets (that are based on the "root"), causing cert-manager to renew them.
The new solution uses a dedicated secret for the Istio-Gateway, managed entirely by Lifecycle-Manager.
This new secret of course has a different name from the "root" secret, that is still managed and rotated by the cert-manager.
The dedicated secret decouples Istio-Gateway from changes to the root certificate - it is Lifecycle-Manager that decides when to propagate the changes.
How can we revert from the new solution to the old one?
If the secret name used for Istio-Gateway is different in the old and new solution, then we would have to change the Helm Charts, or at least the entry in the values.yaml. In addition we would probably need to deploy a different LM version - the one with the "old" code. But then we're reverting the Lifecycle-Manager version, along with all the other features, security fixes etc.

To improve the situation, we should:

  • have a LM version that is capable of running either the old or the new code - depending on some feature flag.
  • the Istio Gateway secret name should be the same in both scenarios, to avoid manual tweaks in the Helm charts.

The first requirement is easy.

The second requirement is more tricky. In order to make it work, we should change the Lifecycle-Manager in the following way (it's just an idea, maybe it can be done in a simpler way):

  • The root certificate is no longer directly used to configure Istio Gateway
  • We introduce our own, managed secret that is a copy of the Istio-Gatewy secret.
  • There is a goroutine that actively syncs the root secret to the copy.
  • Minimal changes in the code are required - LM no longer "watches" the root secret, it watches the copy instead
  • Kustomize configuration of the Istio Gateway (and Helm charts) are changed so that the new secret is used instead of the "root" one.

By introducing this solution, we achieve the following:

  • Both the old and new solution will use the same secret for the Istio-Gateway. Hence, when reverting, there's no need to change anything in the Helm charts.
  • The revert can be accomplished by switching a single boolean flag on the Lifecycle-Manager, like: --cert-mangament-legacy=true, assuming no additional configuration flags are required.
  • Don't worry about the "copy" or the new goroutine - in the "new" solution these also exists. And it's just temporary solution. Once the new solution works correctly we'll remove the support for the "old" one entirely.

Reasons

Safe migration path - ability to revert fast and with minimal risk of doing it wrong - in case of troubles.

Acceptance Criteria

  • Double-check the proposed solution if it's not missing anything important
  • Verify with security team if the LM-managed secret is OK (we need this anyway for the "new solution")

Feature Testing

No response

Testing approach

No response

Attachments

No response

Related Issues:

#1430

@Tomasz-Smelcerz-SAP Tomasz-Smelcerz-SAP added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants