New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add DEP-06: Immutable ETCD Backups #884

Open

seshachalam-yv wants to merge 7 commits into gardener:master from seshachalam-yv:feature/dep-06-immutable-etcd-backups

Contributor

seshachalam-yv commented Oct 1, 2024 •

edited

Loading

How to categorize this PR?

/area backup
/area disaster-recovery
/area security
/area compliance
/area storage
/kind enhancement

What this PR does / why we need it:
This PR adds DEP-06: Immutable ETCD Backups. The proposal aims to enhance the reliability and integrity of ETCD backups in ETCD Druid by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, this approach prevents unauthorized modifications to backup data, ensuring that backups remain available and intact for restoration.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Release note:

Add DEP-06: Immutable ETCD Backups


          Add DEP-06: Immutable ETCD Backups

34718cc

---------

Co-authored-by: Saketh Kalaga <[email protected]>

seshachalam-yv requested a review from a team as a code owner

October 1, 2024 11:46

gardener-robot added needs/review area/backup area/compliance area/disaster-recovery area/security area/storage kind/enhancement size/m labels

gardener-robot-ci-1 added reviewed/ok-to-test needs/ok-to-test and removed reviewed/ok-to-test labels

seshachalam-yv changed the title ~~Add DEP-06: Immutable ETCD Backups~~ Add DEP-06: Immutable ETCD Backups

seshachalam-yv changed the title ~~Add DEP-06: Immutable ETCD Backups~~ Add DEP-06: Immutable ETCD Backups

seshachalam-yv changed the title ~~Add DEP-06: Immutable ETCD Backups~~ Add DEP-06: Immutable ETCD Backups

anveshreddy18 self-assigned this

ishan16696 requested changes

View reviewed changes

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated

+              - Implement immutable backup support for ETCD clusters.
+              - Secure backup data against unintended or unauthorized modifications after creation.
+              - Ensure backups are consistently available and intact for restoration purposes.

Member

ishan16696 Oct 9, 2024

To make backups consistently available that's job of storage provider, so it's wrong to mention this I guess.
I think you can mention about enhancing the garbage collection of backup-restore to work with immutable backups

Collaborator

ashwani2k Oct 10, 2024

I also feel point 3 is not something that we will be doing.
Instead yup we can mention how we manage the life cycle of these backups.

Member

renormalize Nov 27, 2024

Other points also don't make too much sense. Only the second point is a goal. Implementing immutable backup support is a way to achieve the goal.
Only keeping point 2.

docs/proposals/06-immutable-etcd-backups.md Outdated

+              ## Glossary
+              - **ETCD:** A distributed key-value store used as the backing store for Kubernetes.
+              - **Compaction Job:** A process that compacts ETCD snapshots to reduce storage size and improve performance.

Member

ishan16696 Oct 9, 2024

may be you can here mention about snapshot compaction DEP link: https://github.com/gardener/etcd-druid/blob/master/docs/proposals/02-snapshot-compaction.md

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated

+              - **Type:** Duration
+              - **Default:** `24h`
+              - **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that backups remain within the immutability period and are safeguarded against becoming mutable.

Member

ishan16696 Oct 9, 2024

Suggested change

      
            - **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that backups remain within the immutability period and are safeguarded against becoming mutable.
          
            - **Description:** This flag sets the period after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time since the last renewal of the full snapshot lease. If the time since `fullLease.Spec.RenewTime.Time` exceeds the duration specified by this flag, and `etcd.spec.replicas` is `0` (indicating hibernation), the compaction job will automatically trigger to create a new snapshot. This approach ensures that there should be atleast 1 full snapshot remains within the immutability period and are safeguarded against becoming mutable.

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

gardener-robot added the needs/changes label


          Apply suggestions from @ishan16696 code review

ba411e5

Co-authored-by: Ishan Tyagi <[email protected]>

gardener-robot-ci-3 added reviewed/ok-to-test and removed reviewed/ok-to-test labels

ashwani2k requested changes

View reviewed changes

Collaborator

ashwani2k left a comment

Thanks @seshachalam-yv @ishan16696 @renormalize for the proposal.
It captures thing well, but I've put some open points esp. on the structure as well some details esp. as it addresses design considerations.

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated

+              - Implement immutable backup support for ETCD clusters.
+              - Secure backup data against unintended or unauthorized modifications after creation.
+              - Ensure backups are consistently available and intact for restoration purposes.

Collaborator

ashwani2k Oct 10, 2024

I also feel point 3 is not something that we will be doing.
Instead yup we can mention how we manage the life cycle of these backups.

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated


		### Excluding Snapshots Under Specific Circumstances

		Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:

Collaborator

ashwani2k Oct 10, 2024

This can happen even outside of immutable backups scenarios as well, so how is this handled there? I'm guessing currently by deleting manually the affected snapshots.
But with this new approach it should be same mechanism there as well.

Member

renormalize Nov 27, 2024

If snapshots are mutable, this is achieved through deletion of snapshots.
The same functionality will be achieved through custom metadata tags. Will enhance the doc for this.

docs/proposals/06-immutable-etcd-backups.md Outdated


		Given that immutable backups cannot be deleted until the immutability period expires, there are scenarios, such as corrupted snapshots or other anomalies, where certain snapshots must be skipped during the restoration process. To facilitate this:

		- Custom Metadata Tags: Utilize custom metadata to mark specific objects (snapshots) that should be bypassed. To exclude a snapshot from the restoration process, attach custom metadata to it with the key `x-etcd-snapshot-exclude` and value `true`. This method is officially supported, as demonstrated in the [etcd-backup-restore PR](https://github.com/gardener/etcd-backup-restore/pull/776).

Collaborator

ashwani2k Oct 10, 2024

Its not clear from the doc who takes care of attaching the custom metadata flag and how its consumed? Can we describe here to avoid any unintended interpretation of the flow.

Member

renormalize Nov 27, 2024

Human operators add these tags; will include this.

docs/proposals/06-immutable-etcd-backups.md Outdated


		## Implementation Steps

		1. Enhance the Compaction Job:

Collaborator

ashwani2k Oct 10, 2024

I think we should create a new name for the job for Hibernated Full Snapshots and ensure that we have flags and even flow which can leverage the existing compaction feature and enhance it with additional change required for Immuatable backup snapshotting and garbage collection.

Also we cannot have a compaction job for hibernated cluster in practical terms, so it will be even more confusing to see a compaction job running for a hibernated cluster.

docs/proposals/06-immutable-etcd-backups.md Outdated

+                - Configure buckets with appropriate immutability settings before deploying ETCD clusters.
+                - Ensure that the immutability periods align with organizational policies.
+              - **Compaction Job Configuration:**

Collaborator

ashwani2k Oct 10, 2024

What is the retry threshold for this job?
What happens if it fails to run for a period of 24hrs.
What happens if druid is down?
What happens when druid comes back up esp. for failed jobs which have breached the retry threshold?
What happens if we breach the bucket retention period? Is no data to restore possible on wake-up of hibernated clusters.
Does garbage collection runs independent or in sequence only after the job takes a full snapshot on its run.

docs/proposals/06-immutable-etcd-backups.md Show resolved Hide resolved

anveshreddy18 reviewed

View reviewed changes

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Outdated Show resolved Hide resolved

docs/proposals/06-immutable-etcd-backups.md Show resolved Hide resolved

ishan16696 assigned ishan16696 and seshachalam-yv

renormalize self-assigned this

renormalize added this to the v0.25.0 milestone


          Enhance the proposal to use the operator task framework

f28db1f

* The operator task framework is used to enhance the proposal in the approach which re-uploads the latest full snapshot to prolong the immutability.

---------

Co-authored-by: Seshachalam Yerasala Venkata <[email protected]>

gardener-robot added the size/l label

gardener-robot added needs/second-opinion and removed size/m labels

gardener-robot-ci-1 added the reviewed/ok-to-test label

gardener-robot-ci-2 removed the reviewed/ok-to-test label

ishan16696 assigned ashwani2k and unmarshall

ashwani2k requested changes

View reviewed changes

Collaborator

ashwani2k left a comment

Thanks for the well written DEP.
I've some suggestions in structure, naming, and usage. Please have a look.

docs/proposals/06-immutable-etcd-backups.md Outdated


		#### ETCD Backup Configuration

		Operators must ensure that the ETCD backup configuration aligns with the immutability requirements, including setting appropriate immutability periods.

Collaborator

ashwani2k Nov 25, 2024

Can we elaborate here on the ETCD backup configuration or the shoot providerConfig changes for the Backup that we propose to bring as part of this feature.
Also, may be how the same should be passed to standalone druid usage can also be mentioned.

Its important as after this section we suddenly jump into handling of hibernated clusters.

Contributor Author

seshachalam-yv Dec 5, 2024

Yes, we have referred to the links for configuring immutable backup buckets for both standalone and Gardener cases. We have removed the compaction approach since both the compaction and re-uploading approaches use the operator framework and take full snapshots in the same way. The only difference is starting the embedded etcd. The more appropriate approach is re-uploading. Therefore, we have removed the compaction approach as it is redundant, doing the same thing apart from starting the embedded etcd and compacting.

docs/proposals/06-immutable-etcd-backups.md Outdated


		#### Handling of Hibernated Clusters

		When an ETCD cluster is hibernated for a duration exceeding the immutability period, backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees.

Collaborator

ashwani2k Nov 26, 2024

this behavior depends on the cloud provider; refer to Comparison of Storage Provider Properties

Is there a variance with cloud provider on the preceeding statement "backups become mutable again"?

Member

renormalize Dec 2, 2024

Yes. Behavior after expiry of objects behaves on the cloud provider.

docs/proposals/06-immutable-etcd-backups.md Outdated


		When an ETCD cluster is hibernated for a duration exceeding the immutability period, backups may become mutable again (this behavior depends on the cloud provider; refer to [Comparison of Storage Provider Properties](#comparison-of-storage-provider-properties-for-bucket-level-and-object-level-immutability)), compromising the intended immutability guarantees.

		Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the designs of the solutions.

Collaborator

ashwani2k Nov 26, 2024

Suggested change

      
            Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the designs of the solutions.
          
            Such handling of hibernated clusters is the type of scenario which the etcd operator-tasks frameworks lends itself to quite well, and thus for all proposed solutions, the operator tasks framework as defined [here](./05-etcd-operator-tasks.md) will be made use of for the design of the solution.

docs/proposals/06-immutable-etcd-backups.md Outdated


		Proposed Solution:

		Utilize the compaction job to periodically take fresh snapshots during hibernation. Introduce a new flag `--hibernation-snapshot-interval` to the compaction controller. This flag sets the interval after which a compaction job should be triggered for a hibernated ETCD cluster, based on the time elapsed since `fullLease.Spec.RenewTime.Time` and if `etcd.spec.replicas` is `0` (indicating hibernation). The compaction job uses the [compact command](https://github.com/gardener/etcd-backup-restore/blob/master/cmd/compact.go) to create a new snapshot.

Collaborator

ashwani2k Nov 26, 2024

Shouldn't this be a derived information, based on the last full snapshot time and the etcd.spec.replicas being 0.
We already have this information so why not use that.

Contributor Author

seshachalam-yv Dec 5, 2024

We have removed the compaction approach as mentioned here.

Anyways, controller periodically creates the ExtendEtcdSnapshotImmutabilityTask if etcd.spec.backup.store.immutability.retentionType is set to "Bucket" and based on etcd.spec.backup.fullSnapshotSchedule.

docs/proposals/06-immutable-etcd-backups.md Outdated

+                  - Introduce a new flag:
+                    - **Flag:** `--hibernation-snapshot-interval`
+                      - **Type:** Duration
+                      - **Default:** `24h`

Collaborator

ashwani2k Nov 26, 2024

If someone sets this value to more than 24h lets say sets it to 72h, won't this already break the contract of immutability. I think we should not expose this internal detail as a config, unless you have a case at hand where this is required.

Member

renormalize Dec 2, 2024

It's the responsibility of the operator that sets up etcd-druid to configure this flag correctly.
For example, if the bucket is configured to be immutable for 15 days, then the operator wanting to trigger snapshots every 3 days is fine, a new snapshot every day is unnecessary so this can be left configurable, in my opinion.

docs/proposals/06-immutable-etcd-backups.md Outdated

+                  - The controller scales in the ETCD cluster (i.e., sets `StatefulSet.spec.replicas` to zero).
+                  - The controller creates the `EtcdSnapshotImmutabilityExtension` periodically if `etcd.spec.backup.store.immutableSettings.retentionType` is set to `"Bucket"`.
+              - **`EtcdSnapshotImmutabilityExtension` specification:**

Collaborator

ashwani2k Nov 26, 2024

Suggested change

      
            - **`EtcdSnapshotImmutabilityExtension` specification:**
          
            - **`ExtendEtcdSnapshotImmutability` specification:**

Contributor Author

seshachalam-yv Dec 5, 2024

Addressed

docs/proposals/06-immutable-etcd-backups.md Outdated

+              - **Backward Compatibility:**
+                - Existing clusters without immutable buckets will continue to function without change.
+                - The introduction of the `EtcdSnapshotImmutabilityExtension` does not affect clusters that are not hibernated.

Collaborator

ashwani2k Nov 26, 2024

Suggested change

      
              - The introduction of the `EtcdSnapshotImmutabilityExtension` does not affect clusters that are not hibernated.
          
              - The introduction of the `ExtendEtcdSnapshotImmutability` does not affect clusters that are not hibernated.

Contributor Author

seshachalam-yv Dec 5, 2024

Addressed

docs/proposals/06-immutable-etcd-backups.md Outdated

+                  This functionality is needed since it would be necessary to garbage collect the (identical final) snapshots that are (re)uploaded in order to ensure that there is always a snapshot which is immutable.
+              - **Update `Etcd` CRD:**
+                - Add `etcd.spec.hibernation`:
+                  Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`.

Collaborator

ashwani2k Nov 26, 2024

It is not clear how will the controller will adopt the existing ETCD resources which have a hibernation schedule already in place, esp. in the context of Gardener usage.

Contributor Author

seshachalam-yv Dec 5, 2024

We have removed the section hibernation (support for specifying an intent for hibernation) as a non goal, since this will be handled with different DEP #922.

docs/proposals/06-immutable-etcd-backups.md Outdated


		###### Disadvantages

		- Additional Complexity: Requires updates to the etcd controller, introduction of the operator-tasks controller, and introduction of new etcdbrctl commands.

Collaborator

ashwani2k Nov 26, 2024

Why do we see this as additional complexity, weren't we planning to anyways implement an operator-task controller?

Member

renormalize Nov 27, 2024

Additional complexity since it is a hard prerequisite; but you are right. This is the only way forward; will remove this.

docs/proposals/06-immutable-etcd-backups.md Outdated


		- Resource Consumption: Starting an embedded ETCD instance periodically consumes resources.

		##### Approach 2: Re-upload of the latest snapshot

Collaborator

ashwani2k Nov 26, 2024

We call this approach -- Re-upload of the latest snapshot while in conclusion section we have called this approach Copy backup task. Can we have one naming convention for the approach.

Member

renormalize Nov 27, 2024

Yeah, this was a miss while renaming the approach to "re-upload of the latest"

shreyas-s-rao modified the milestones: v0.25.0, v0.26.0

anveshreddy18 reviewed

View reviewed changes

docs/proposals/06-immutable-etcd-backups.md Outdated

+                  This functionality is needed since it would be necessary to garbage collect the (identical final) snapshots that are (re)uploaded in order to ensure that there is always a snapshot which is immutable.
+              - **Update `Etcd` CRD:**
+                - Add `etcd.spec.hibernation`:
+                  Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`.

Contributor

anveshreddy18 Nov 28, 2024

Suggested change

      
                Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernated`.
          
                Since there are situations outside of hibernation where the number of replicas of the statefulset would have to be scaled to zero, there needs to be an explicit way in which it is conveyed to etcd-druid that the etcd cluster is being hibernated. This can be achieved by extending the `Etcd` CRD by including a new field in the `spec` called `hibernation`.

The field is called hibernation in some places and hibernated in other places. Can you check which one you wanted to have and correct the remaining to that

Member

renormalize Nov 29, 2024

Thanks for your review. Will do this.

docs/proposals/06-immutable-etcd-backups.md Outdated

+                - Add `immutableSettings.retentionType` under `etcd.spec.backup.store`.
+              - **ETCD Controller Logic:**
+                - When hibernation is requested, by changing `etcd.spec.hibernated.enabled` to `true`:

Contributor

anveshreddy18 Nov 28, 2024

Suggested change

      
              - When hibernation is requested, by changing `etcd.spec.hibernated.enabled` to `true`:
          
              - When hibernation is requested, by changing `etcd.spec.hibernation.enabled` to `true`:

renormalize mentioned this pull request

Improve docs/usage/immutable_snapshots.md, adding example commands to make buckets immutable. gardener/etcd-backup-restore#806

Merged

gardener deleted a comment from gardener-prow bot

shreyas-s-rao requested changes

View reviewed changes

Contributor

shreyas-s-rao left a comment •

edited

Loading

@renormalize @seshachalam-yv thanks a lot for the wonderful proposal, and for detailing out ways to make backups immutable. I have a few comments/questions. PTAL, thanks!

docs/proposals/06-immutable-etcd-backups.md Outdated

+              - [etcd-backup-restore PR #776](https://github.com/gardener/etcd-backup-restore/pull/776)
+              - [EtcdCopyBackupsTask Implementation](https://github.com/gardener/etcd-druid/pull/544)
+              ## Glossary

Contributor

shreyas-s-rao Nov 29, 2024

I would prefer if the Glossary was at the beginning, after Summary section. Reason being, it should be easy for a new reader to simply read through the entire document sequentially without having to jump between sections in the document.

Contributor Author

seshachalam-yv Dec 5, 2024 •

edited

Loading

Addressed.
We have moved and renamed to terminology after Summary section.

docs/proposals/06-immutable-etcd-backups.md Outdated


		### Overview

		We propose introducing immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This approach will prevent data alterations post-creation, enhancing data integrity and security.

Contributor

shreyas-s-rao Nov 29, 2024

As an example of the previous comment, you mention here about WORM model, but have not yet defined it anywhere, and the reader is required to scroll down to the glossary and come back to this section later. This seems slightly non-intuitive.

Contributor Author

seshachalam-yv Dec 5, 2024

Done

docs/proposals/06-immutable-etcd-backups.md Outdated

		@@ -0,0 +1,315 @@
		---
		title: Immutable ETCD Backups

Contributor

shreyas-s-rao Nov 29, 2024

I recently learned that etcd must always be written in lower case characters, ie etcd, and not ETCD. There are apparently certain conventions around the way it is pronounced as well. Can we change all occurrences of ETCD to etcd if that's possible?

Member

renormalize Dec 2, 2024

Yes. I've already started doing so. I've been a proponent of this since the start.

For future reference, could you link the resource where you read these guidelines for me and the other maintainers as well? Thanks.

docs/proposals/06-immutable-etcd-backups.md Outdated


		This proposal aims to enhance the reliability and integrity of ETCD backups created by `etcd-backup-restore` in ETCD clusters managed by `etcd-druid`, by introducing immutable backups. By leveraging cloud provider features that support a write-once-read-many (WORM) model, unauthorized modifications to backup data are prevented, ensuring that backups remain intact and accessible for restoration.

		The proposed solution relies on `etcd-druid` to manage ETCD backups and handle hibernation processes effectively. It leverages one of the suggested approaches to ensure backups remain immutable over extended periods. It is important to note that using `etcd-backup-restore` standalone may not be sufficient to achieve this functionality end-to-end, as the immutability handling (with respect to hibernation) is specifically managed within `etcd-druid`.

Contributor

shreyas-s-rao Nov 29, 2024

hibernation processes doesn't seem to have been defined yet, and will be difficult for a non-Gardener reader to understand this term. I see that it's in the glossary, so maybe you can link to that section, or give some background about this here.

docs/proposals/06-immutable-etcd-backups.md Outdated


		### Overview

		We propose introducing immutability in backup storage by leveraging cloud provider features that support a write-once-read-many (WORM) model. This approach will prevent data alterations post-creation, enhancing data integrity and security.

Contributor

shreyas-s-rao Nov 29, 2024

It's usually recommended to use the term the authors rather than we, to refer to the authors in third person. Can you please make that change everywhere in the proposal?

docs/proposals/06-immutable-etcd-backups.md Outdated

+              - **Failed Snapshot Before Hibernation:**
+                - **Risk:** Failure to take a full snapshot before hibernation could delay the hibernation process.
+                - **Mitigation:** Implement robust error handling and retries. Notify operators of failures to take corrective action.

Contributor

shreyas-s-rao Dec 2, 2024

What is the behavior if full snapshot fails after repeated retries? Will this infinitely block etcd cluster scale-down? Or would you propose to allow it to happen, but update Etcd.status.lastErrors and have an operator look into it? If case 2, then the extend-immutability approach might not work, since it only copies the latest full snapshot and not the latest set of snapshots. Would that need to be changed then?

Contributor Author

seshachalam-yv Dec 5, 2024

What is the behavior if full snapshot fails after repeated retries? Will this infinitely block etcd cluster scale-down?

Yes.

docs/proposals/06-immutable-etcd-backups.md Outdated


		- Review Retention Policies:

		- Set `maxBackups` and `maxBackupAge` in the `EtcdCopyBackupsTask` to manage storage utilization effectively.

Contributor

shreyas-s-rao Dec 2, 2024

These might be insufficient. I would instead propose to introduce a new flag like latest-snapshot-set-only to tell the copy command to only copy over the latest set of snapshots, and not be bothered about the snapshot age or number of snapshots.

Contributor Author

seshachalam-yv Dec 5, 2024

As mentioned #884 (comment), we are not going to use copy command.

docs/proposals/06-immutable-etcd-backups.md


		#### Considerations for Object-Level Immutability

		Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation.

Contributor

shreyas-s-rao Dec 2, 2024

Can you please elaborate on these possible scenarios? Would be good to have an idea of what you have already thought of as a use case for object-level immutability, and why you feel it's less practical to support such use cases with object-level immutability.

Contributor

unmarshall Dec 6, 2024

with object-level immutability all you need is to extend the immutability for the latest full snapshot. You do not really need to copy the full snapshot which is currently a hacky solution put in place because we are not going directly to object-level immutability policy.

docs/proposals/06-immutable-etcd-backups.md Outdated


		#### Conclusion

		Given the complexities and limitations, we recommend using bucket-level immutability in conjunction with the `EtcdCopyBackupsTask` approach (Approach 2) to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements. The compaction job approach (Approach 1) is also viable but may introduce more resource consumption and operational overhead.

Contributor

shreyas-s-rao Dec 2, 2024

You can make the recommendation in bold, so that it stands out as your final expert opinion ;)

docs/proposals/06-immutable-etcd-backups.md


		Given the complexities and limitations, we recommend using bucket-level immutability in conjunction with the `EtcdCopyBackupsTask` approach (Approach 2) to manage immutability during hibernation effectively. This approach provides a balance between operational simplicity and meeting immutability requirements. The compaction job approach (Approach 1) is also viable but may introduce more resource consumption and operational overhead.

		##### Comparison of Storage Provider Properties for Bucket-Level and Object-Level Immutability

Contributor

shreyas-s-rao Dec 2, 2024

Thanks a lot for this comparison section!


          Expand Summary and Motivation, add Hibernation, reword other sections.

8a90dca

gardener-robot-ci-3 added the reviewed/ok-to-test label

gardener-robot-ci-1 removed the reviewed/ok-to-test label


          Rename extend-immutability command to renew-snapshot

57e91a0

gardener-robot-ci-1 added reviewed/ok-to-test and removed reviewed/ok-to-test labels


          docs: Update proposal for immutable etcd cluster backups

b68574d

- Improve readability and clarity of the summary and motivation sections.
- Add detailed terminology definitions.
- Refine the proposal to focus on bucket-level immutability.

gardener-robot added size/m and removed size/l labels

gardener-robot-ci-1 added the reviewed/ok-to-test label

gardener-robot-ci-2 removed the reviewed/ok-to-test label


          addressed feedback

39255f7

gardener-robot-ci-1 added reviewed/ok-to-test and removed reviewed/ok-to-test labels

unmarshall requested changes

View reviewed changes

docs/proposals/06-immutable-etcd-backups.md


		## Summary

		Currently, `etcd-druid` can provision etcd clusters and manage their lifecycle. Additionally, it enables regular backups of the etcd cluster state through the sidecar container `etcd-backup-restore`, which is deployed in each etcd pod running a member of the etcd cluster. This functionality is activated when `spec.backup` is enabled with appropriate values for an etcd cluster.

Contributor

unmarshall Dec 6, 2024

Suggested change

      
            Currently, `etcd-druid` can provision etcd clusters and manage their lifecycle. Additionally, it enables regular backups of the etcd cluster state through the sidecar container `etcd-backup-restore`, which is deployed in each etcd pod running a member of the etcd cluster. This functionality is activated when `spec.backup` is enabled with appropriate values for an etcd cluster.
          
            `etcd-druid` provisions etcd clusters and manage their lifecycle. For every etcd cluster, consumers can enable periodic backups of the cluster state by configuring `spec.backup` section in an Etcd custom resource. Periodic backups are taken via the `etcd-backup-restore` sidecar container that runs in each etcd member pod.```

docs/proposals/06-immutable-etcd-backups.md


		Currently, `etcd-druid` can provision etcd clusters and manage their lifecycle. Additionally, it enables regular backups of the etcd cluster state through the sidecar container `etcd-backup-restore`, which is deployed in each etcd pod running a member of the etcd cluster. This functionality is activated when `spec.backup` is enabled with appropriate values for an etcd cluster.

		All actors (with sufficient privileges) in the cluster where `etcd-druid` is deployed, and in the etcd clusters it provisions, have access to the `Secret` that holds the credentials used to upload snapshots of the etcd cluster state. These credentials are used by system actors and human operators—typically to perform various maintenance and recovery operations.

Contributor

unmarshall Dec 6, 2024

This paragraph is not required. The intent is to protect the integrity of backups once uploaded for a defined period of time as these represent an etcd cluster state at a given time and enables restoration. This can be done in many different ways therefore talking about one particular Secret only will not be complete. But then again your intent here is not to list all possible attack vectors which can comprise the backups and therefore i would totally avoid mentioning this here.

docs/proposals/06-immutable-etcd-backups.md


		All actors (with sufficient privileges) in the cluster where `etcd-druid` is deployed, and in the etcd clusters it provisions, have access to the `Secret` that holds the credentials used to upload snapshots of the etcd cluster state. These credentials are used by system actors and human operators—typically to perform various maintenance and recovery operations.

		To prevent erroneous operations by human operators during maintenance and recovery, or by misbehaving actors in the cluster - which could potentially lead to an unrecoverable restoration failure, the authors propose using write-once-read-many ([WORM](#terminology)) features offered by various cloud providers where available.

Contributor

unmarshall Dec 6, 2024

Suggested change

      
            To prevent erroneous operations by human operators during maintenance and recovery, or by misbehaving actors in the cluster - which could potentially lead to an unrecoverable restoration failure, the authors propose using write-once-read-many ([WORM](#terminology)) features offered by various cloud providers where available.
          
            Periodic backups of an etcd cluster state ensure the ability to recover from a complete quorum loss, enhancing reliability and fault tolerance. It is crucial that these backups, which are vital for restoring the etcd cluster, remain protected from any form of tampering, whether intentional or accidental. To safeguard the integrity of these backups, the authors recommend utilizing WORM (write-once-read-many) protection, a feature offered by various cloud providers, to ensure the backups remain immutable and secure.```

docs/proposals/06-immutable-etcd-backups.md


		To prevent erroneous operations by human operators during maintenance and recovery, or by misbehaving actors in the cluster - which could potentially lead to an unrecoverable restoration failure, the authors propose using write-once-read-many ([WORM](#terminology)) features offered by various cloud providers where available.

		This [WORM](#terminology) model will enhance the reliability and integrity of etcd cluster state backups created by `etcd-backup-restore` in etcd clusters managed and operated by `etcd-druid` by ensuring that the backups are [immutable](#terminology) for a specific period from the time they are uploaded, thereby preventing any unintended modifications.

Contributor

unmarshall Dec 6, 2024

This para is not required if you are ok with the change suggested in the above para.

docs/proposals/06-immutable-etcd-backups.md


		This [WORM](#terminology) model will enhance the reliability and integrity of etcd cluster state backups created by `etcd-backup-restore` in etcd clusters managed and operated by `etcd-druid` by ensuring that the backups are [immutable](#terminology) for a specific period from the time they are uploaded, thereby preventing any unintended modifications.

		`etcd-druid` and `etcd-backup-restore` will be enhanced to achieve the same functionality currently achieved by modifying or deleting backups, but without actually modifying or deleting these backups, since they will now be immutable for a set duration. This approach eliminates the possibility of potential data loss. `etcd-druid` will provide an end-to-end solution for achieving this functionality, as relying solely on `etcd-backup-restore` is insufficient given the scope and possible approaches to achieving this.

Contributor

unmarshall Dec 6, 2024

This para is not at all clear to me.

docs/proposals/06-immutable-etcd-backups.md


		#### Considerations for Object-Level Immutability

		Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation.

Contributor

unmarshall Dec 6, 2024

with object-level immutability all you need is to extend the immutability for the latest full snapshot. You do not really need to copy the full snapshot which is currently a hacky solution put in place because we are not going directly to object-level immutability policy.

docs/proposals/06-immutable-etcd-backups.md


		Using object-level immutability provides flexibility in scenarios where certain backups require different immutability periods. However, current limitations and complexities make it less practical for immediate implementation.

		- Enabling object-level immutability requires bucket-level immutability to be set first (applicable in S3 and ABS). In GCS, the capability to enable object-level immutability on an existing bucket is not available.

Contributor

unmarshall Dec 6, 2024

Can you please also mention the plan/date by which GCS is going to enable object level immutability. If there is an issue open please provide a link here.

docs/proposals/06-immutable-etcd-backups.md


		Disadvantages:

		- Provider Limitations: Enabling object-level immutability on existing buckets is not universally supported.

Contributor

unmarshall Dec 6, 2024

Can you elaborate on this point a bit more? What is not universal?

docs/proposals/06-immutable-etcd-backups.md

+              **Disadvantages:**
+              - **Provider Limitations:** Enabling object-level immutability on existing buckets is not universally supported.
+              - **Increased Complexity:** Requires additional logic in backup processes and tooling.

Contributor

unmarshall Dec 6, 2024

What is the increase in complexity as compared to what we are already introducing for bucket level immutability policies?

docs/proposals/06-immutable-etcd-backups.md

+              - **Provider Limitations:** Enabling object-level immutability on existing buckets is not universally supported.
+              - **Increased Complexity:** Requires additional logic in backup processes and tooling.
+              - **Prerequisites:** Some providers require bucket-level immutability to be set first.

Contributor

unmarshall Dec 6, 2024

why is this a con?

Contributor

unmarshall commented Dec 6, 2024

Thanks for the PR @seshachalam-yv and @renormalize. I have added comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

anveshreddy18 anveshreddy18 left review comments

renormalize renormalize left review comments

ashwani2k ashwani2k requested changes

ishan16696 ishan16696 requested changes

shreyas-s-rao shreyas-s-rao requested changes

unmarshall unmarshall requested changes

Requested changes must be addressed to merge this pull request.

Labels

area/backup area/compliance area/disaster-recovery area/security area/storage kind/enhancement needs/changes needs/ok-to-test needs/review needs/second-opinion size/m