Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Vault sealing itself when a storage error occur while unsealing #19459

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

remilapeyre
Copy link
Contributor

@remilapeyre remilapeyre commented Mar 5, 2023

When running in HA mode Vault will perform some operations right after getting leadership:

  1. save the upgrades of the keyring to the storage backend so that they can be loaded by the standby nodes with no disruption
  2. reload the root key
  3. reload the entire keyring
  4. reload the shaming keys
  5. start a background task to cleanup the updates written in 1. to the storage backend after they have been loaded by the standby nodes

If any error happens during those tasks Vault will suppose that it is because the keys it has loaded where not correct, it then relinquish the leadership lock so that another server can start active operation and finally it seals itself.

This is a correct behavior in some cases but not always, for example if Vault loses the leadership while doing those operations they will be cancelled. The node should then go in standby mode but not get sealed because the error is not related to an issue with the keys. This case has been specifically handled by 8e93f59.

There is other issues that can trigger Vault sealing itself like an issue with the storage backend. Most storage backends like Consul will fail if there is some network issues. This mean that Vault can seal itself when there is a network disruption and not unseal automatically when the disruption is fixed. This defeats the purpose of high-availability backends and auto-unseal.

This patch does two things: first it improves the sys/health endpoint to return the specific status code 523 when an error has happened during the post-unseal operations with a new "post_unseal_failed" attribute in the body.

It also improves the resiliency of the post-unseal process by not sealing the Vault when an error happens when accessing the storage backend. Those errors will just be logged, the leadership will be returned and sys/health will return 523 but the node will try again to take the leadership and resume operations if the underlying issue with the storage backend has been resolved.

This is a compromise between keeping Vault alive despite small network issues, but still reporting clearly an error to the operators in case it requires manual intervention.

Other errors like failing to decrypt or decode the keyring will continue to seal the Vault like before.

Related to #10552
Related to #3896

When running in HA mode Vault will perform some operations right after
getting leadership:

  1. save the upgrades of the keyring to the storage backend so that they can be loaded by the standby nodes with no disruption
  2. reload the root key
  3. reload the entire keyring
  4. reload the shaming keys
  5. start a background task to cleanup the updates written in 1. to the storage backend after they have been loaded by the standby nodes

If any error happens during those tasks Vault will suppose that it is
because the keys it has loaded where not correct, it then relinquish the
leadership lock so that another server can start active operation and
finally it seals itself.

This is a correct behavior in some cases but not always, for example if
Vault loses the leadership while doing those operations they will be
cancelled. The node should then go in standby mode but not get sealed because
the error is not related to an issue with the keys. This case has been
specifically handled by 8e93f59.

There is other issues that can trigger Vault sealing itself like an issue
with the storage backend. Most storage backends like Consul will fail if
there is some network issues. This mean that Vault can seal itself when
there is a network disruption and not unseal automatically when the disruption
is fixed. This defeats the purpose of high-availability backends and auto-unseal.

This patch does two things: first it improves the sys/health endpoint to
return the specific status code 523 when an error has happened during the
post-unseal operations with a new "post_unseal_failed" attribute in the
body.

It also improves the resiliency of the post-unseal process by not sealing
the Vault when an error happens when accessing the storage backend. Those
errors will just be logged, the leadership will be returned and sys/health
will return 523 but the node will try again to take the leadership and
resume operations if the underlying issue with the storage backend has
been resolved.

This is a compromise between keeping Vault alive despite small network
issues, but still reporting clearly an error to the operators in case it
requires manual intervention.

Other errors like failing to decrypt or decode the keyring will continue
to seal the Vault like before.

Related to hashicorp#10552
Related to hashicorp#3896
@remilapeyre remilapeyre marked this pull request as ready for review March 6, 2023 00:41
@remilapeyre remilapeyre requested a review from a team March 6, 2023 00:41
@remilapeyre remilapeyre requested a review from yhyakuna as a code owner March 6, 2023 00:41
@sgmiller sgmiller requested review from a team February 29, 2024 18:56
@heatherezell heatherezell added the bug Used to indicate a potential bug label Mar 21, 2024
@heatherezell heatherezell requested a review from schavis March 21, 2024 22:41
@heatherezell
Copy link
Contributor

Tagging @schavis for docs review <3

@biazmoreira
Copy link
Contributor

Hi, @remilapeyre! Thanks for the PR. Would you be able to fix the conflicting files so we can review this? Thanks!

@heatherezell
Copy link
Contributor

Quick ping here! @remilapeyre if you can resolve the merge conflicts, I'll push this in front of an engineer. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Used to indicate a potential bug core/seal waiting-for-response
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants