Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple backup ready condition for snapshot leases and remove support for deprecated fields in etcd status #820

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

anveshreddy18
Copy link
Contributor

@anveshreddy18 anveshreddy18 commented Jun 26, 2024

How to categorize this PR?

/area usability monitoring
/kind enhancement cleanup

What this PR does / why we need it:

This PR splits the existing BackupReady condition which talks about both full and delta snapshot health as a single condition into two new conditions FullSnapshotBackupReady and DeltaSnapshotBackupReady, one for each type of snapshot i.e Full and Delta respectively. This helps in better understanding of status of each backup type thus helping debug issues and analyze more effectively. The existing BackupReady condition will still be present, which aggregates both the individual snapshot conditions into one.

This PR also removes the support for few fields in etcd status which are in deprecated state for enough time. The fields that are removed are ServiceName, ClusterSize, UpdatedReplicas and LabelSelector. These fields were essentially taken from statefulset and not necessarily needed to be present under etcd.status hence they were deprecated and this PR stops the support.

This PR partially fixes this issue for streamlining etcd status

Which issue(s) this PR fixes:

Special notes for your reviewer:

Release note:

The existing `BackupReady` condition in `etcd.status` is split into 2 new specific conditions `FullSnapshotBackupReady` & `DeltaSnapshotBackupReady` to better understand the health of each type of snapshot backup
Support for fields `ServiceName`, `ClusterSize`,  `UpdatedReplicas` and `LabelSelector`  in `etcd.status`  are removed after they've been deprecated for 4 minor releases 

@anveshreddy18 anveshreddy18 requested a review from a team as a code owner June 26, 2024 16:32
@gardener-robot gardener-robot added needs/review Needs review size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Jun 26, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 26, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jun 26, 2024
@gardener-robot gardener-robot added area/monitoring Monitoring (including availability monitoring and alerting) related area/usability Usability related kind/cleanup Something that is not needed anymore and can be cleaned up kind/enhancement Enhancement, improvement, extension labels Jun 26, 2024
@gardener gardener deleted a comment from gardener-robot Jun 26, 2024
@anveshreddy18 anveshreddy18 self-assigned this Jun 26, 2024
@anveshreddy18 anveshreddy18 force-pushed the decouple_snapshot_leases branch from 715260b to d0f3163 Compare June 26, 2024 16:43
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 26, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Jun 26, 2024
@anveshreddy18
Copy link
Contributor Author

/retest

1 similar comment
@anveshreddy18
Copy link
Contributor Author

/retest

@seshachalam-yv
Copy link
Contributor

/assign

Copy link
Contributor

@seshachalam-yv seshachalam-yv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove Cluster Size and Backup Ready.

// +kubebuilder:printcolumn:name="Cluster Size",type=integer,JSONPath=`.spec.replicas`,priority=1

// +kubebuilder:printcolumn:name="Backup Ready",type=string,JSONPath=`.status.conditions[?(@.type=="BackupReady")].status`

Add FullSnapshotBackupReady and DeltaSnapshotBackupReady as part of print columns

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Jul 9, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 10, 2024
@anveshreddy18
Copy link
Contributor Author

@seshachalam-yv Nice find, thanks. Done

@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Jul 18, 2024
go.sum Outdated Show resolved Hide resolved
test/e2e/etcd_backup_test.go Outdated Show resolved Hide resolved
@seshachalam-yv seshachalam-yv removed their assignment Aug 19, 2024
@gardener-robot gardener-robot added the needs/rebase Needs git rebase label Aug 21, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 21, 2024
@gardener gardener deleted a comment from gardener-prow bot Aug 21, 2024
@anveshreddy18 anveshreddy18 force-pushed the decouple_snapshot_leases branch from c696b9c to 4e2d378 Compare August 21, 2024 06:37
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 21, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 21, 2024
@anveshreddy18 anveshreddy18 force-pushed the decouple_snapshot_leases branch from 4e2d378 to 98b46de Compare August 21, 2024 06:40
@gardener-robot-ci-3 gardener-robot-ci-3 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 21, 2024
@gardener gardener deleted a comment from gardener-robot Aug 21, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 21, 2024
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 21, 2024
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 21, 2024
@anveshreddy18 anveshreddy18 force-pushed the decouple_snapshot_leases branch from 9036d63 to b0b9725 Compare August 27, 2024 07:55
@gardener-robot-ci-1 gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 27, 2024
@anveshreddy18
Copy link
Contributor Author

/test pull-etcd-druid-e2e-kind

Copy link
Contributor

@unmarshall unmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to pair program this. Current code needs a LOT of changes.

// +kubebuilder:printcolumn:name="Current Replicas",type=integer,JSONPath=`.status.currentReplicas`,priority=1
// +kubebuilder:printcolumn:name="Ready Replicas",type=integer,JSONPath=`.status.readyReplicas`,priority=1
// +kubebuilder:printcolumn:name="Backup Ready",type=string,JSONPath=`.status.conditions[?(@.type=="BackupReady")].status`
// +kubebuilder:printcolumn:name="Full Backup Ready",type=string,JSONPath=`.status.conditions[?(@.type=="FullSnapshotBackupReady")].status`
// +kubebuilder:printcolumn:name="Delta Backup Ready",type=string,JSONPath=`.status.conditions[?(@.type=="DeltaSnapshotBackupReady")].status`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call them Delta Backup Ready or Delta Snapshot Backup Ready? Similarly for full snapshot.

@@ -79,6 +79,8 @@ Status fields related to the etcd cluster itself, such as `Members`, `PeerUrlTLS
- `AllMembersReady`: indicates readiness of all members of the etcd cluster.
- `Ready`: indicates overall readiness of the etcd cluster in serving traffic.
- `BackupReady`: indicates health of the etcd backups, i.e., whether etcd backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
- `FullSnapshotBackupReady`: indicates health of the full snapshot backups, i.e whether full snapshot backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which situation will this be set to false? If one full snapshot is missed or more than one missed?

@@ -79,6 +79,8 @@ Status fields related to the etcd cluster itself, such as `Members`, `PeerUrlTLS
- `AllMembersReady`: indicates readiness of all members of the etcd cluster.
- `Ready`: indicates overall readiness of the etcd cluster in serving traffic.
- `BackupReady`: indicates health of the etcd backups, i.e., whether etcd backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
- `FullSnapshotBackupReady`: indicates health of the full snapshot backups, i.e whether full snapshot backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
- `DeltaSnapshotBackupReady`: indicates health of the delta snapshot backups, i.e whether delta snapshot backups are being taken regularly as per schedule. This condition is applicable only when backups are enabled for the etcd cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the criteria for this value to be false?

@@ -122,7 +122,7 @@ kubectl apply -f config/samples/druid_v1alpha1_etcd.yaml

### Verify the Etcd cluster

To obtain information regarding the newly instantiated etcd cluster, perform the following step, which gives details such as the cluster size, readiness status of its members, and various other attributes.
To obtain information regarding the newly instantiated etcd cluster, perform the following step, which gives details such as the current & ready replicas, readiness status of its members, Full and Delta backup status and various other attributes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To obtain information regarding the newly instantiated etcd cluster, perform the following step, which gives details such as the current & ready replicas, readiness status of its members, Full and Delta backup status and various other attributes.
To obtain information regarding the newly instantiated etcd cluster, perform the following step, which gives details such as the current & ready replicas, readiness status of its members, Full Snapshot and Delta Snapshot backup status and various other attributes.

}

var FullSnapshotBackupReadyCheckResult, DeltaSnapshotBackupReadyCheckResult Result = nil, nil
for _, result := range b.results {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by using the loop variable result you are shadowing the variable result defined at the start of Check method. Please avoid shadowing.

)
fullSnapErr = a.cl.Get(ctx, types.NamespacedName{Name: getFullSnapLeaseName(&etcd), Namespace: etcd.ObjectMeta.Namespace}, fullSnapLease)
incrSnapErr = a.cl.Get(ctx, types.NamespacedName{Name: getDeltaSnapLeaseName(&etcd), Namespace: etcd.ObjectMeta.Namespace}, deltaSnapLease)
fullSnapErr = f.cl.Get(ctx, types.NamespacedName{Name: getFullSnapLeaseName(&etcd), Namespace: etcd.ObjectMeta.Namespace}, fullSnapLease)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is an error getting the full snapshot lease then is the error being ignored immediately but handled much later? If there is an error you should exit soon with error instead of continuing with the computations.

}
}

func (d *deltaSnapshotBackupReadyCheck) Check(ctx context.Context, etcd druidv1alpha1.Etcd) Result {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a LOT of code repetition and this code needs to be rewritten.

@@ -36,7 +36,8 @@ var (
ConditionChecks = []ConditionCheckFn{
condition.AllMembersReadyCheck,
condition.ReadyCheck,
condition.BackupReadyCheck,
condition.FullSnapshotBackupReadyCheck,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have removed BackupReadyCheck here but you have also changed and kept the implementation of BackupReady in check_backup_ready.go. Any reason for that?

@@ -90,10 +91,12 @@ func (c *Checker) executeConditionChecks(ctx context.Context, etcd *druidv1alpha
wg.Wait()
})()

results := make([]condition.Result, 0, len(ConditionChecks))
results := make([]condition.Result, 0, len(ConditionChecks)+1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this conditions checks + 1?

for r := range resultCh {
results = append(results, r)
}
backupReadyChecker := condition.BackupReadyCheck(c.cl, results)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really do not understand this. You removed it from ConditionChecks but introduce this here? why?

@anveshreddy18 anveshreddy18 removed their assignment Nov 11, 2024
Copy link

gardener-prow bot commented Nov 14, 2024

@anveshreddy18: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-druid-check-renovate-config b0b9725 link true /test pull-etcd-druid-check-renovate-config
pull-etcd-druid-integration b0b9725 link true /test pull-etcd-druid-integration
pull-etcd-druid-unit b0b9725 link true /test pull-etcd-druid-unit
pull-etcd-druid-verify-image-build b0b9725 link true /test pull-etcd-druid-verify-image-build

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring Monitoring (including availability monitoring and alerting) related area/usability Usability related kind/cleanup Something that is not needed anymore and can be cleaned up kind/enhancement Enhancement, improvement, extension needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/rebase Needs git rebase needs/review Needs review needs/second-opinion Needs second review by someone else size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants