-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backend failure during the re-election process causes Vault cluster to become unusable. #4491
Comments
You may want to try more recent versions of Vault that have updated zk libraries to see if any of those fixes help this issue. |
If the error was ephermeral it seems like the zk library or the zk physical implementation should be handling retries there. Additionally the logic in core has changed somewhat in more recent versions of Vault so I don't know if that would make a difference. Vault is ignoring any error coming from the Unlock call, and while I don't konw the genesis of that, my guess is that it's unclear what the right path is if an error is returned. |
This can definitely happen and |
If you have a proposal for addressing this I'm happy to hear it, but I don't think having vault spin on the Unlock call is the right answer. Maybe for a few retries, but if you're trying to shut down Vault, having it hang for potentially long periods of time isn't great. |
@jefferai I agree that trying to repeat unlock multiple times wouldn't be acceptable. I'd propose to let each backend handle potential retries. To be more clear I'd like to replace not looking at
|
It's more complicated than that -- we can't simply os.Exit because there are important cleanup operations that can happen during normal shutdown. I think instead it would have to trigger shutdown and pass along an appropriate error. However, I also don't quite understand how this would help. If the lock is stuck, why does restarting Vault matter, instead of simply trying to create a new one from the host in a running Vault? It sounds like in your scenario the lock is able to be grabbed again by the same host, but then it shouldn't reqiure a restart of Vault to do so. |
I wasn't clear about that in my comment. This would help in the ZK backend case. I am not familiar with other backends implementations. A ZK lock is implemented by creating a ZK ephemeral node. While a ZK client is connected to ZK, the ZK keeps node around until client disconnects. However there is a SessionTimeout variable that would keep ephemeral node around after short client disconnects. This is what is happening in a case that @vespian described. The vaults ZK client looses connection to a ZK and tries to remove the lock. Lock removal fails. Vault ignores the error and steps down as a leader. However if underlying ZK client in vault connects back to ZK within the SessionTimeout the node will not get removed. Here restarting the vault would give a ZK client new session and old one would get expired and the leader election node would get removed. |
It feels like the better option is to explicitly close the connection after a step down (perhaps only in the case that the lock fails). Forcing a restart of Vault also means forcing a re-unseal of Vault...it's a very big hammer. |
Proposed solution in #4569 |
Closed by #4569 |
* Seal HA: Use new SealWrappedValue type to abstract seal wrapped values Introduce SealWrappedValue to abstract seal wrapped values. Make SealWrappedValue capable of marshalling into a BlobInfo, when there is plaintext or a single encryption, or to a custom serialization consisting of a header, length and a marshalled MultiWrapValue protobuf. * Vault-13769: Support configuring and using multiple seals for unsealing * Make sealWrapBackend start using multiple seals * Make seal.Access no longer implement wrapping.Wrapper. Instead, add the Encrypt and Decrypt methods to the Access interface. * Make raft snapshot system use funcs SealWrapValue + UnsealWrapValue. Move the snapshot.Sealer implementation to the vault package to avoid circular imports. * Update sealWrapBackend to use multiple seals for encryption. Use all the encryption wrappers when storing seal wrapped values. Try do decrypt using the highest priority wrapper, but try all combinations of encrypted values and wrappers if necessary. * Allow the use of multiple seals for entropy augmentation Add seal_name variable in entropy stanza Add new MultiSourcer to accommodate the new entropy augmentation behavior. * Individually health check each wrapper, and add a sys/seal-backend-status endpoint. * Address a race, and also a failed test mock that I didn't catch * Track partial wrapping failures... ... where one or more but not all access.Encrypts fail for a given write. Note these failures by adding a time ordered UUID storage entry containing the path in a special subdirectory of root storage. Adds a callback pattern to accomplish this, with certain high value writes like initial barrier key storage not allowing a partial failure. The followup work would be to detect return to health and iterate through these storage entries, rewrapping. * Add new data structure to track seal config generation (#4492) * Add new data structure to track seal config generation * Remove import cycle * Fix undefined variable errors * update comment * Update setSeal response * Fix setSealResponse in operator_diagnose * Scope the wrapper health check locks individually (#4491) * Refactor setSeal function in server.go. (#4505) Refactor setSeal function in server.go. * Decouple CreateSecureRandomReaderFunc from seal package. Instead of using a list of seal.SealInfo structs, make CreateSecureRandomReaderFunc use a list of new EntropySourcerInfo structs. This brakes the denpency of package configutil on the seal package. * Move SealGenerationInfo tracking to the seal Access. * Move SealGenerationInfo tracking to the seal Access. The SealGenerationInfo is now kept track by a Seal's Access instead of by the Config object. The access implementation now records the correct generation number on seal wrapped values. * Only store and read SealGenerationInfo if VAULT_ENABLE_SEAL_HA_BETA is true. * Add MultiWrapValue protobuf message MultiWrapValue can be used to keep track of different encryptions of a value. --------- Co-authored-by: Victor Rodriguez <[email protected]> * Use generation to determine if a seal wrapped value is up-to-date. (#4542) * Add logging to seal Access implementation. * Seal HA buf format run (#4561) * Run buf format. * Add buf.lock to ensure go-kms-wrapping module is imported. * Vault-18958: Add unit tests for config checks * Add safety logic for seal configuration changes * Revert "Add safety logic for seal configuration changes" This reverts commit 7fec48035a5cf274e5a4d98901716d08d766ce90. * changes and tests for checking seal config * add ent tests * remove check for empty name and add type into test cases * add error message for empty name * fix no seals test --------- Co-authored-by: divyapola5 <[email protected]> * Handle migrations between single-wrapper and multi-wrapper autoSeals * Extract method SetPhysicalSealConfig. * Extract function physicalSealConfig. The extracted function is the only code now reading SealConfig entries from storage. * Extract function setPhysicalSealConfig. The extracted function is the only code now writing SealConfig entries from storage (except for migration from the old recovery config path). * Move SealConfig to new file vault/seal_config.go. * Add SealConfigType quasy-enumeration. SealConfigType is to serve as the typed values for field SealConfig.Type. * Rename Seal.RecoveryType to RecoverySealConfigType. Make RecoverySealConfigType return a SealConfigType instead of a string. * Rename Seal.BarrierType to BarrierSealConfigType. Make BarrierSealConfigType return a SealConfigType. Remove seal.SealType (really a two-step rename to SealConfigType). * Add Seal methods ClearBarrierConfig and ClearRecoveryConfig. * Handle autoseal <-> multiseal migrations. While going between single-wrapper and multiple-wrapper autoseals are not migrations that require an unwrap seal (such as going from shamir to autoseal), the stored "barrier" SealConfig needs to be updated in these cases. Specifically, the value of SealConfg.Type is "multiseal" for autoSeals that have more than one wrapper; on the other hand, for autoseals with a single wrapper, SealConfig.Type is the type of the wrapper. * Remove error return value from NewAutoSeal constructor. * Automatically rewrap partially seal wrapped values on an interval * Add in rewrapping of partially wrapped values on an interval, regardless of seal health/status. * Don't set SealGenerationInfo Rewrapped flag in the partial rewrap call. * Unexport the SealGenerationInfo's Rewrapped field, add a mutex to it for thread safe access, and add accessor methods for it. * Add a success callback to the manual seal rewrap process that updates the SealGenerationInfo's rewrapped field. This is done via a callback to avoid an import cycle in the SealRewrap code. * Fix a failing seal wrap backend test which was broken by the unexporting of SealGenerationInfo's Rewrapped field. * Nil check the seal rewrap success callback before calling it. * Change SealGenerationInfo rewrapped parameter to an atomic.Bool rather than a sync.RWMutex for simplicity and performance. * Add nil check for SealAccess before updating SealGenerationInfo rewrapped status during seal rewrap call. * Update partial rewrap check interval from 10 seconds to 1 minute. * Update a reference to SealGenerationInfo Rewrapped field to use new getter method. * Fix up some data raciness in partial rewrapping. * Account for possibly nil storage entry when retrieving partially wrapped value. * Allow multi-wrapper autoSeals to include disabled seal wrappers. * Restore propagation of wrapper configuration errors by setSeal. Function setSeal is meant to propagate non KeyNotFound errors returned by calls to configutil.ConfigureWrapper. * Remove unused Access methods SetConfig and Type. * Allow multi-wrapper autoSeals to include disabled seal wrappers. Make it possible for an autoSeal that uses multiple wrappers to include disabled wrappers that can be used to decrypt entries, but are skipped for encryption. e an unwrapSeal when there are disabled seals. * Fix bug with not providing name (#4580) * add suffix to name defaults * add comment * only change name for disabled seal * Only attempt to rewrap partial values when all seals are healthy. * Only attempt to rewrap partial values when all seals are healthy. * Change logging level from info to debug for notice about rewrap skipping based on seal health. * Remove stale TODOs and commented out code. --------- Co-authored-by: rculpepper <[email protected]> Co-authored-by: Larroyo <[email protected]> Co-authored-by: Scott G. Miller <[email protected]> Co-authored-by: Divya Pola <[email protected]> Co-authored-by: Matt Schultz <[email protected]> Co-authored-by: divyapola5 <[email protected]> Co-authored-by: Rachel Culpepper <[email protected]>
Environment:
Vault Config File:
Startup Log Output:
Expected Behavior:
Zookeeper/backend issues during leader reelection do not cause Vault cluster to become unreachable.
Actual Behavior:
Please see the logs from the moment of a failure and imediatelly after:
Steps to Reproduce:
I belive that the problem occurs when a ZK failure occurs right after the new leader was advertised to the Vault cluster but before leader acquiring process has finished [1].
[1] https://github.com/hashicorp/vault/blob/v0.8.3/vault/core.go#L1583-L1605
Possible root cause:
My theory is that the ZK failure was long enough to not to allow subsequent
lock.Unlock()
executein the
c.postUnseal()
error handling block but not enough to expire ephemeral nodes made by ZK lock - other Vault instances were not allowed to perform reelection as the leader lock is still there.Is this something well know? If not - could you please advice - whether the theory/idea for the root cause of this bug is correct and/or whether this is something that could/should be fixed in the
vault/core.go
module itself?Thanks in advance.
The text was updated successfully, but these errors were encountered: