Fix Vault sealing itself when a storage error occur while unsealing #19459

remilapeyre · 2023-03-05T23:51:48Z

When running in HA mode Vault will perform some operations right after getting leadership:

save the upgrades of the keyring to the storage backend so that they can be loaded by the standby nodes with no disruption
reload the root key
reload the entire keyring
reload the shaming keys
start a background task to cleanup the updates written in 1. to the storage backend after they have been loaded by the standby nodes

If any error happens during those tasks Vault will suppose that it is because the keys it has loaded where not correct, it then relinquish the leadership lock so that another server can start active operation and finally it seals itself.

This is a correct behavior in some cases but not always, for example if Vault loses the leadership while doing those operations they will be cancelled. The node should then go in standby mode but not get sealed because the error is not related to an issue with the keys. This case has been specifically handled by 8e93f59.

There is other issues that can trigger Vault sealing itself like an issue with the storage backend. Most storage backends like Consul will fail if there is some network issues. This mean that Vault can seal itself when there is a network disruption and not unseal automatically when the disruption is fixed. This defeats the purpose of high-availability backends and auto-unseal.

This patch does two things: first it improves the sys/health endpoint to return the specific status code 523 when an error has happened during the post-unseal operations with a new "post_unseal_failed" attribute in the body.

It also improves the resiliency of the post-unseal process by not sealing the Vault when an error happens when accessing the storage backend. Those errors will just be logged, the leadership will be returned and sys/health will return 523 but the node will try again to take the leadership and resume operations if the underlying issue with the storage backend has been resolved.

This is a compromise between keeping Vault alive despite small network issues, but still reporting clearly an error to the operators in case it requires manual intervention.

Other errors like failing to decrypt or decode the keyring will continue to seal the Vault like before.

Related to #10552
Related to #3896

When running in HA mode Vault will perform some operations right after getting leadership: 1. save the upgrades of the keyring to the storage backend so that they can be loaded by the standby nodes with no disruption 2. reload the root key 3. reload the entire keyring 4. reload the shaming keys 5. start a background task to cleanup the updates written in 1. to the storage backend after they have been loaded by the standby nodes If any error happens during those tasks Vault will suppose that it is because the keys it has loaded where not correct, it then relinquish the leadership lock so that another server can start active operation and finally it seals itself. This is a correct behavior in some cases but not always, for example if Vault loses the leadership while doing those operations they will be cancelled. The node should then go in standby mode but not get sealed because the error is not related to an issue with the keys. This case has been specifically handled by 8e93f59. There is other issues that can trigger Vault sealing itself like an issue with the storage backend. Most storage backends like Consul will fail if there is some network issues. This mean that Vault can seal itself when there is a network disruption and not unseal automatically when the disruption is fixed. This defeats the purpose of high-availability backends and auto-unseal. This patch does two things: first it improves the sys/health endpoint to return the specific status code 523 when an error has happened during the post-unseal operations with a new "post_unseal_failed" attribute in the body. It also improves the resiliency of the post-unseal process by not sealing the Vault when an error happens when accessing the storage backend. Those errors will just be logged, the leadership will be returned and sys/health will return 523 but the node will try again to take the leadership and resume operations if the underlying issue with the storage backend has been resolved. This is a compromise between keeping Vault alive despite small network issues, but still reporting clearly an error to the operators in case it requires manual intervention. Other errors like failing to decrypt or decode the keyring will continue to seal the Vault like before. Related to hashicorp#10552 Related to hashicorp#3896

heatherezell · 2024-03-21T22:42:03Z

Tagging @schavis for docs review <3

biazmoreira · 2024-05-21T08:58:53Z

Hi, @remilapeyre! Thanks for the PR. Would you be able to fix the conflicting files so we can review this? Thanks!

heatherezell · 2024-07-02T17:36:14Z

Quick ping here! @remilapeyre if you can resolve the merge conflicts, I'll push this in front of an engineer. Thanks!

remilapeyre force-pushed the unseal-shutdown branch from 554b260 to a417b3d Compare March 6, 2023 00:33

remilapeyre marked this pull request as ready for review March 6, 2023 00:41

remilapeyre requested a review from a team March 6, 2023 00:41

remilapeyre requested a review from yhyakuna as a code owner March 6, 2023 00:41

sgmiller added the core/seal label Feb 29, 2024

sgmiller requested review from a team February 29, 2024 18:56

heatherezell added the bug Used to indicate a potential bug label Mar 21, 2024

heatherezell requested a review from schavis March 21, 2024 22:41

biazmoreira added the waiting-for-response label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Vault sealing itself when a storage error occur while unsealing #19459

Fix Vault sealing itself when a storage error occur while unsealing #19459

remilapeyre commented Mar 5, 2023 •

edited

Loading

heatherezell commented Mar 21, 2024

biazmoreira commented May 21, 2024

heatherezell commented Jul 2, 2024

Fix Vault sealing itself when a storage error occur while unsealing #19459

Are you sure you want to change the base?

Fix Vault sealing itself when a storage error occur while unsealing #19459

Conversation

remilapeyre commented Mar 5, 2023 • edited Loading

heatherezell commented Mar 21, 2024

biazmoreira commented May 21, 2024

heatherezell commented Jul 2, 2024

remilapeyre commented Mar 5, 2023 •

edited

Loading