-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DEP-06: Immutable ETCD Backups #884
base: master
Are you sure you want to change the base?
Add DEP-06: Immutable ETCD Backups #884
Conversation
--------- Co-authored-by: Saketh Kalaga <[email protected]>
DEP-06: Immutable ETCD Backups
DEP-06: Immutable ETCD Backups
DEP-06
: Immutable ETCD Backups
DEP-06
: Immutable ETCD Backups
|
||
- Implement immutable backup support for ETCD clusters. | ||
- Secure backup data against unintended or unauthorized modifications after creation. | ||
- Ensure backups are consistently available and intact for restoration purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make backups consistently available
that's job of storage provider, so it's wrong to mention this I guess.
I think you can mention about enhancing the garbage collection of backup-restore to work with immutable backups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel point 3 is not something that we will be doing.
Instead yup we can mention how we manage the life cycle of these backups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other points also don't make too much sense. Only the second point is a goal. Implementing immutable backup support is a way to achieve the goal.
Only keeping point 2.
## Glossary | ||
|
||
- **ETCD:** A distributed key-value store used as the backing store for Kubernetes. | ||
- **Compaction Job:** A process that compacts ETCD snapshots to reduce storage size and improve performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be you can here mention about snapshot compaction DEP link: https://github.com/gardener/etcd-druid/blob/master/docs/proposals/02-snapshot-compaction.md
|
||
- **Type:** Duration | ||
- **Default:** `24h` | ||
- **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that backups remain within the immutability period and are safeguarded against becoming mutable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that backups remain within the immutability period and are safeguarded against becoming mutable. | |
- **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that there should be atleast 1 full snapshot remains within the immutability period and are safeguarded against becoming mutable. |
Co-authored-by: Ishan Tyagi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @seshachalam-yv @ishan16696 @renormalize for the proposal.
It captures thing well, but I've put some open points esp. on the structure as well some details esp. as it addresses design considerations.
|
||
- Implement immutable backup support for ETCD clusters. | ||
- Secure backup data against unintended or unauthorized modifications after creation. | ||
- Ensure backups are consistently available and intact for restoration purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also feel point 3 is not something that we will be doing.
Instead yup we can mention how we manage the life cycle of these backups.
|
||
### Excluding Snapshots Under Specific Circumstances | ||
|
||
Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can happen even outside of immutable backups scenarios as well, so how is this handled there? I'm guessing currently by deleting manually the affected snapshots.
But with this new approach it should be same mechanism there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If snapshots are mutable, this is achieved through deletion of snapshots.
The same functionality will be achieved through custom metadata tags. Will enhance the doc for this.
|
||
Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this: | ||
|
||
- **Custom Metadata Tags:** Utilize custom metadata to mark specific objects (snapshots) that should be bypassed. To exclude a snapshot from the restoration process, attach custom metadata to it with the key `x-etcd-snapshot-exclude` and value `true`. This method is officially supported, as demonstrated in the [etcd-backup-restore PR](https://github.com/gardener/etcd-backup-restore/pull/776). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not clear from the doc who takes care of attaching the custom metadata flag and how its consumed? Can we describe here to avoid any unintended interpretation of the flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Human operators add these tags; will include this.
|
||
## Implementation Steps | ||
|
||
1. **Enhance the Compaction Job:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should create a new name for the job for Hibernated Full Snapshots and ensure that we have flags and even flow which can leverage the existing compaction feature and enhance it with additional change required for Immuatable backup snapshotting and garbage collection.
Also we cannot have a compaction job for hibernated cluster in practical terms, so it will be even more confusing to see a compaction job running for a hibernated cluster.
- Configure buckets with appropriate immutability settings before deploying ETCD clusters. | ||
- Ensure that the immutability periods align with organizational policies. | ||
|
||
- **Compaction Job Configuration:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the retry threshold for this job?
What happens if it fails to run for a period of 24hrs.
What happens if druid is down?
What happens when druid comes back up esp. for failed jobs which have breached the retry threshold?
What happens if we breach the bucket retention period? Is no data to restore possible on wake-up of hibernated clusters.
Does garbage collection runs independent or in sequence only after the job takes a full snapshot on its run.
* The operator task framework is used to enhance the proposal in the approach which re-uploads the latest full snapshot to prolong the immutability. --------- Co-authored-by: Seshachalam Yerasala Venkata <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the well written DEP.
I've some suggestions in structure, naming, and usage. Please have a look.
|
||
#### ETCD Backup Configuration | ||
|
||
Operators must ensure that the ETCD backup configuration aligns with the immutability requirements, including setting appropriate immutability periods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we elaborate here on the ETCD backup configuration or the shoot providerConfig changes for the Backup that we propose to bring as part of this feature.
Also, may be how the same should be passed to standalone druid usage can also be mentioned.
Its important as after this section we suddenly jump into handling of hibernated clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have referred to the links for configuring immutable backup buckets for both standalone and Gardener cases. We have removed the compaction approach since both the compaction and re-uploading approaches use the operator framework and take full snapshots in the same way. The only difference is starting the embedded etcd
. The more appropriate approach is re-uploading. Therefore, we have removed the compaction approach as it is redundant, doing the same thing apart from starting the embedded etcd
and compacting.
|
||
#### Handling of Hibernated Clusters | ||
|
||
When an ETCD cluster is hibernated for a duration exceeding the immutability period, backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this behavior depends on the cloud provider; refer to Comparison of Storage Provider Properties
Is there a variance with cloud provider on the preceeding statement "backups become mutable again"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Behavior after expiry of objects behaves on the cloud provider.
|
||
When an ETCD cluster is hibernated for a duration exceeding the immutability period, backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees. | ||
|
||
Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the designs of the solutions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the designs of the solutions. | |
Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the design of the solution. |
|
||
**Proposed Solution:** | ||
|
||
Utilize the compaction job to periodically take fresh snapshots during hibernation. Introduce a new flag `--hibernation-snapshot-interval` to the compaction controller. This flag sets the interval after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time elapsed since `fullLease.Spec.RenewTime.Time` and if `etcd.spec.replicas` is `0` (indicating hibernation). The compaction job uses the [compact command](https://github.com/gardener/etcd-backup-restore/blob/master/cmd/compact.go) to create a new snapshot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a derived information, based on the last full snapshot time and the etcd.spec.replicas being 0
.
We already have this information so why not use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have removed the compaction approach as mentioned here.
Anyways, controller periodically creates the ExtendEtcdSnapshotImmutabilityTask
if etcd.spec.backup.store.immutability.retentionType
is set to "Bucket"
and based on etcd.spec.backup.fullSnapshotSchedule
.
- Introduce a new flag: | ||
- **Flag:** `--hibernation-snapshot-interval` | ||
- **Type:** Duration | ||
- **Default:** `24h` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone sets this value to more than 24h lets say sets it to 72h, won't this already break the contract of immutability. I think we should not expose this internal detail as a config, unless you have a case at hand where this is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the responsibility of the operator that sets up etcd-druid to configure this flag correctly.
For example, if the bucket is configured to be immutable for 15 days, then the operator wanting to trigger snapshots every 3 days is fine, a new snapshot every day is unnecessary so this can be left configurable, in my opinion.
- The controller scales in the ETCD cluster (i.e., sets `StatefulSet.spec.replicas` to zero). | ||
- The controller creates the `EtcdSnapshotImmutabilityExtension` periodically if `etcd.spec.backup.store.immutableSettings.retentionType` is set to `"Bucket"`. | ||
|
||
- **`EtcdSnapshotImmutabilityExtension` specification:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- **`EtcdSnapshotImmutabilityExtension` specification:** | |
- **`ExtendEtcdSnapshotImmutability` specification:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
|
||
- **Backward Compatibility:** | ||
- Existing clusters without immutable buckets will continue to function without change. | ||
- The introduction of the `EtcdSnapshotImmutabilityExtension` does not affect clusters that are not hibernated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The introduction of the `EtcdSnapshotImmutabilityExtension` does not affect clusters that are not hibernated. | |
- The introduction of the `ExtendEtcdSnapshotImmutability` does not affect clusters that are not hibernated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
This functionality is needed since it would be necessary to garbage collect the (identical final) snapshots that are (re)uploaded in order to ensure that there is always a snapshot which is immutable. | ||
- **Update `Etcd` CRD:** | ||
- Add `etcd.spec.hibernation`: | ||
Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear how will the controller will adopt the existing ETCD
resources which have a hibernation schedule already in place, esp. in the context of Gardener usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have removed the section hibernation (support for specifying an intent for hibernation
) as a non goal, since this will be handled with different DEP #922.
|
||
###### Disadvantages | ||
|
||
- **Additional Complexity:** Requires updates to the etcd controller, introduction of the operator-tasks controller, and introduction of new etcdbrctl commands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we see this as additional complexity, weren't we planning to anyways implement an operator-task controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional complexity since it is a hard prerequisite; but you are right. This is the only way forward; will remove this.
|
||
- **Resource Consumption:** Starting an embedded ETCD instance periodically consumes resources. | ||
|
||
##### Approach 2: Re-upload of the latest snapshot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call this approach -- Re-upload of the latest snapshot
while in conclusion section we have called this approach Copy backup task
. Can we have one naming convention for the approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a miss while renaming the approach to "re-upload of the latest"
This functionality is needed since it would be necessary to garbage collect the (identical final) snapshots that are (re)uploaded in order to ensure that there is always a snapshot which is immutable. | ||
- **Update `Etcd` CRD:** | ||
- Add `etcd.spec.hibernation`: | ||
Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`. | |
Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernation`. |
The field is called hibernation
in some places and hibernated
in other places. Can you check which one you wanted to have and correct the remaining to that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. Will do this.
|
||
- Add `immutableSettings.retentionType` under `etcd.spec.backup.store`. | ||
- **ETCD Controller Logic:** | ||
- When hibernation is requested, by changing `etcd.spec.hibernated.enabled` to `true`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When hibernation is requested, by changing `etcd.spec.hibernated.enabled` to `true`: | |
- When hibernation is requested, by changing `etcd.spec.hibernation.enabled` to `true`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@renormalize @seshachalam-yv thanks a lot for the wonderful proposal, and for detailing out ways to make backups immutable. I have a few comments/questions. PTAL, thanks!
- [etcd-backup-restore PR #776](https://github.com/gardener/etcd-backup-restore/pull/776) | ||
- [EtcdCopyBackupsTask Implementation](https://github.com/gardener/etcd-druid/pull/544) | ||
|
||
## Glossary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if the Glossary was at the beginning, after Summary
section. Reason being, it should be easy for a new reader to simply read through the entire document sequentially without having to jump between sections in the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed.
We have moved and renamed to terminology after Summary
section.
|
||
### Overview | ||
|
||
We propose introducing immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This approach will prevent data alterations post-creation, enhancing data integrity and security. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example of the previous comment, you mention here about WORM model, but have not yet defined it anywhere, and the reader is required to scroll down to the glossary and come back to this section later. This seems slightly non-intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -0,0 +1,315 @@ | |||
--- | |||
title: Immutable ETCD Backups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently learned that etcd must always be written in lower case characters, ie etcd
, and not ETCD
. There are apparently certain conventions around the way it is pronounced as well. Can we change all occurrences of ETCD
to etcd
if that's possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I've already started doing so. I've been a proponent of this since the start.
For future reference, could you link the resource where you read these guidelines for me and the other maintainers as well? Thanks.
|
||
This proposal aims to enhance the reliability and integrity of ETCD backups created by `etcd-backup-restore` in ETCD clusters managed by `etcd-druid`, by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, unauthorized modifications to backup data are prevented, ensuring that backups remain intact and accessible for restoration. | ||
|
||
The proposed solution relies on `etcd-druid` to manage ETCD backups and handle hibernation processes effectively. It leverages one of the suggested approaches to ensure backups remain immutable over extended periods. It is important to note that using `etcd-backup-restore` standalone may not be sufficient to achieve this functionality end-to-end, as the immutability handling (with respect to hibernation) is specifically managed within `etcd-druid`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hibernation processes
doesn't seem to have been defined yet, and will be difficult for a non-Gardener reader to understand this term. I see that it's in the glossary, so maybe you can link to that section, or give some background about this here.
|
||
### Overview | ||
|
||
We propose introducing immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This approach will prevent data alterations post-creation, enhancing data integrity and security. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's usually recommended to use the term the authors
rather than we
, to refer to the authors in third person. Can you please make that change everywhere in the proposal?
- **Failed Snapshot Before Hibernation:** | ||
|
||
- **Risk:** Failure to take a full snapshot before hibernation could delay the hibernation process. | ||
- **Mitigation:** Implement robust error handling and retries. Notify operators of failures to take corrective action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behavior if full snapshot fails after repeated retries? Will this infinitely block etcd cluster scale-down? Or would you propose to allow it to happen, but update Etcd.status.lastErrors and have an operator look into it? If case 2, then the extend-immutability approach might not work, since it only copies the latest full snapshot and not the latest set of snapshots. Would that need to be changed then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behavior if full snapshot fails after repeated retries? Will this infinitely block etcd cluster scale-down?
Yes.
|
||
- **Review Retention Policies:** | ||
|
||
- Set `maxBackups` and `maxBackupAge` in the `EtcdCopyBackupsTask` to manage storage utilization effectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These might be insufficient. I would instead propose to introduce a new flag like latest-snapshot-set-only
to tell the copy command to only copy over the latest set of snapshots, and not be bothered about the snapshot age or number of snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned #884 (comment), we are not going to use copy command.
|
||
#### Considerations for Object-Level Immutability | ||
|
||
Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please elaborate on these possible scenarios? Would be good to have an idea of what you have already thought of as a use case for object-level immutability, and why you feel it's less practical to support such use cases with object-level immutability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with object-level immutability all you need is to extend the immutability for the latest full snapshot. You do not really need to copy the full snapshot which is currently a hacky solution put in place because we are not going directly to object-level immutability policy.
|
||
#### Conclusion | ||
|
||
Given the complexities and limitations, we recommend using bucket-level immutability in conjunction with the `EtcdCopyBackupsTask` approach (Approach 2) to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements. The compaction job approach (Approach 1) is also viable but may introduce more resource consumption and operational overhead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make the recommendation in bold, so that it stands out as your final expert opinion ;)
|
||
Given the complexities and limitations, we recommend using bucket-level immutability in conjunction with the `EtcdCopyBackupsTask` approach (Approach 2) to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements. The compaction job approach (Approach 1) is also viable but may introduce more resource consumption and operational overhead. | ||
|
||
##### Comparison of Storage Provider Properties for Bucket-Level and Object-Level Immutability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this comparison section!
- Improve readability and clarity of the summary and motivation sections. - Add detailed terminology definitions. - Refine the proposal to focus on bucket-level immutability.
|
||
## Summary | ||
|
||
Currently, `etcd-druid` can provision etcd clusters and manage their lifecycle. Additionally, it enables regular backups of the etcd cluster state through the sidecar container `etcd-backup-restore`, which is deployed in each etcd pod running a member of the etcd cluster. This functionality is activated when `spec.backup` is enabled with appropriate values for an etcd cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, `etcd-druid` can provision etcd clusters and manage their lifecycle. Additionally, it enables regular backups of the etcd cluster state through the sidecar container `etcd-backup-restore`, which is deployed in each etcd pod running a member of the etcd cluster. This functionality is activated when `spec.backup` is enabled with appropriate values for an etcd cluster. | |
`etcd-druid` provisions etcd clusters and manage their lifecycle. For every etcd cluster, consumers can enable periodic backups of the cluster state by configuring `spec.backup` section in an Etcd custom resource. Periodic backups are taken via the `etcd-backup-restore` sidecar container that runs in each etcd member pod.``` |
|
||
Currently, `etcd-druid` can provision etcd clusters and manage their lifecycle. Additionally, it enables regular backups of the etcd cluster state through the sidecar container `etcd-backup-restore`, which is deployed in each etcd pod running a member of the etcd cluster. This functionality is activated when `spec.backup` is enabled with appropriate values for an etcd cluster. | ||
|
||
All actors (with sufficient privileges) in the cluster where `etcd-druid` is deployed, and in the etcd clusters it provisions, have access to the `Secret` that holds the credentials used to upload snapshots of the etcd cluster state. These credentials are used by system actors and human operators—typically to perform various maintenance and recovery operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph is not required. The intent is to protect the integrity of backups once uploaded for a defined period of time as these represent an etcd cluster state at a given time and enables restoration. This can be done in many different ways therefore talking about one particular Secret only will not be complete. But then again your intent here is not to list all possible attack vectors which can comprise the backups and therefore i would totally avoid mentioning this here.
|
||
All actors (with sufficient privileges) in the cluster where `etcd-druid` is deployed, and in the etcd clusters it provisions, have access to the `Secret` that holds the credentials used to upload snapshots of the etcd cluster state. These credentials are used by system actors and human operators—typically to perform various maintenance and recovery operations. | ||
|
||
To prevent erroneous operations by human operators during maintenance and recovery, or by misbehaving actors in the cluster - which could potentially lead to an unrecoverable restoration failure, the authors propose using write-once-read-many ([WORM](#terminology)) features offered by various cloud providers where available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To prevent erroneous operations by human operators during maintenance and recovery, or by misbehaving actors in the cluster - which could potentially lead to an unrecoverable restoration failure, the authors propose using write-once-read-many ([WORM](#terminology)) features offered by various cloud providers where available. | |
Periodic backups of an etcd cluster state ensure the ability to recover from a complete quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing WORM (write-once-read-many) protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure.``` |
|
||
To prevent erroneous operations by human operators during maintenance and recovery, or by misbehaving actors in the cluster - which could potentially lead to an unrecoverable restoration failure, the authors propose using write-once-read-many ([WORM](#terminology)) features offered by various cloud providers where available. | ||
|
||
This [WORM](#terminology) model will enhance the reliability and integrity of etcd cluster state backups created by `etcd-backup-restore` in etcd clusters managed and operated by `etcd-druid` by ensuring that the backups are [*immutable*](#terminology) for a specific period from the time they are uploaded, thereby preventing any unintended modifications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This para is not required if you are ok with the change suggested in the above para.
|
||
This [WORM](#terminology) model will enhance the reliability and integrity of etcd cluster state backups created by `etcd-backup-restore` in etcd clusters managed and operated by `etcd-druid` by ensuring that the backups are [*immutable*](#terminology) for a specific period from the time they are uploaded, thereby preventing any unintended modifications. | ||
|
||
`etcd-druid` and `etcd-backup-restore` will be enhanced to achieve the same functionality currently achieved by modifying or deleting backups, but without actually modifying or deleting these backups, since they will now be immutable for a set duration. This approach eliminates the possibility of potential data loss. `etcd-druid` will provide an end-to-end solution for achieving this functionality, as relying solely on `etcd-backup-restore` is insufficient given the scope and possible approaches to achieving this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This para is not at all clear to me.
|
||
#### Considerations for Object-Level Immutability | ||
|
||
Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with object-level immutability all you need is to extend the immutability for the latest full snapshot. You do not really need to copy the full snapshot which is currently a hacky solution put in place because we are not going directly to object-level immutability policy.
|
||
Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation. | ||
|
||
- Enabling object-level immutability requires bucket-level immutability to be set first (applicable in S3 and ABS). In GCS, the capability to enable object-level immutability on an existing bucket is not available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also mention the plan/date by which GCS is going to enable object level immutability. If there is an issue open please provide a link here.
|
||
**Disadvantages:** | ||
|
||
- **Provider Limitations:** Enabling object-level immutability on existing buckets is not universally supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this point a bit more? What is not universal?
**Disadvantages:** | ||
|
||
- **Provider Limitations:** Enabling object-level immutability on existing buckets is not universally supported. | ||
- **Increased Complexity:** Requires additional logic in backup processes and tooling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the increase in complexity as compared to what we are already introducing for bucket level immutability policies?
|
||
- **Provider Limitations:** Enabling object-level immutability on existing buckets is not universally supported. | ||
- **Increased Complexity:** Requires additional logic in backup processes and tooling. | ||
- **Prerequisites:** Some providers require bucket-level immutability to be set first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a con?
Thanks for the PR @seshachalam-yv and @renormalize. I have added comments. |
How to categorize this PR?
/area backup
/area disaster-recovery
/area security
/area compliance
/area storage
/kind enhancement
What this PR does / why we need it:
This PR adds
DEP-06: Immutable ETCD Backups
. The proposal aims to enhance the reliability and integrity of ETCD backups in ETCD Druid by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, this approach prevents unauthorized modifications to backup data, ensuring that backups remain available and intact for restoration.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Release note: