Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRD Validation for Etcd resource using CEL expressions #995

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

Shreyas-s14
Copy link
Contributor

@Shreyas-s14 Shreyas-s14 commented Feb 3, 2025

How to categorize this PR?

/area quality
/area control-plane
/area open-source
/kind enhancement

What this PR does / why we need it:

  • Performs validations for the etcd resource via kubebuilder markers, using a combination of CEL Expressions via the x-validation and the validation markers.
  • Verifies the effectiveness of the validation rules using integration tests.
  • To support k8s clusters < 1.29 and >=1.29 it generates 2 versions of CRDs one with CEL expressions and one without.
  • api/core/crds is changed to now have API methods which takes in a k8s version to provide the appropriate CRD.
  • Enhances make deploy-* and make prepare-helm-chart targets to work with CRDs with and without CEL expressions.
  • Updates the documentation to include instructions on how to properly use the CRDs when setting up etcd-druid.
  • Adapts IT tests setup to take the correct CRDs.

Which issue(s) this PR fixes:
Fixes #989

Special notes for your reviewer:

Release note:

 Introduces 'Etcd` custom resource validations using CEL expressions. This will be available for kubernetes clusters with version >= 1.29 only.

@Shreyas-s14 Shreyas-s14 requested a review from a team as a code owner February 3, 2025 20:49
@gardener-robot gardener-robot added needs/review Needs review area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension labels Feb 3, 2025
@gardener-robot
Copy link

@Shreyas-s14 Thank you for your contribution.

@gardener-robot gardener-robot added size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py) needs/second-opinion Needs second review by someone else labels Feb 3, 2025
@gardener-robot-ci-1
Copy link
Contributor

Thank you @Shreyas-s14 for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

@shreyas-s-rao shreyas-s-rao added this to the v0.28.0 milestone Feb 4, 2025
Copy link
Contributor

@seshachalam-yv seshachalam-yv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @Shreyas-s14, for adding the validations for the etcd CR. I've provided a few initial review comments based on my first glance. I will share more detailed feedback after a thorough review.

ReelectionPeriod *metav1.Duration `json:"reelectionPeriod,omitempty"`
// EtcdConnectionTimeout defines the timeout duration for etcd client connection during leader election.
// +optional
// +kubebuilder:validation:Type=string
// +kubebuilder:validationXValidation:message="Invalid duration given for etcd.spec.backup.leaderElection.etcdCOnnectionTimeout",rule="self.matches('^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// +kubebuilder:validationXValidation:message="Invalid duration given for etcd.spec.backup.leaderElection.etcdCOnnectionTimeout",rule="self.matches('^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$')"
// +kubebuilder:validationXValidation:message="Invalid duration given for etcd.spec.backup.leaderElection.etcdConnectionTimeout",rule="self.matches('^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$')"


### Field validations
- The fields which expect only a particular set of values are checked by using the kubebuilder marker: `+kubebuilder:validation:Enum=<value1>;<value2>`
* The `etcd.spec.etcd.metrics` can only be set as either `basic` or `extesnive`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The `etcd.spec.etcd.metrics` can only be set as either `basic` or `extesnive`.
* The `etcd.spec.etcd.metrics` can only be set as either `basic` or `extensive`.

g.Expect(cl.Create(ctx, etcd)).To(Succeed())

etcd.Spec.Replicas = int32(test.updatedReplicas)
time.Sleep(30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use time.sleep, For more detailed guidance, please refer to the

* Do not use `time.Sleep` and friends as it renders the tests flaky.

@gardener-robot gardener-robot added the needs/changes Needs (more) changes label Feb 5, 2025
@Shreyas-s14
Copy link
Contributor Author

/test pull-etcd-druid-e2e-kind

@shreyas-s-rao shreyas-s-rao self-assigned this Feb 24, 2025
Shreyas-s14 and others added 11 commits February 27, 2025 10:43
* Added `itTestEnv.StartManager()` to start the manager in `setupTestEnvironment()`.

* Upgraded teh env-test k8s version to `1.30`.

* Changed the timeout to `20s` in `assertReconcileSuspensionEventRecorded`
  to prevent the test timing out.

* Removed an unnecessary `time.Sleep` in `TestValidateSpecReplicas`.

* Uncommented previously commented options which caused regressions.
2) added version compatibility docs
3) reduced code reuse in specUpdate_test.go
1) increased value of timeout passed in test/it/helper.go createAndAssertEtcdAndAllManagedResources
2) increased the timeout in test/e2e/suite_test.go to 20 minutes each
Reverted the timeout value in test/it/helper.go which was increased to handle failure of prow jobs
@@ -0,0 +1,37 @@
# CRD Validations for etcd

* The validations for the fields within the `etcd` resource are done via [kubebuilder markers for CRD validation](https://book.kubebuilder.io/reference/markers/crd-validation).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some where in this doc you must mention that this is meant for k8s clusters with version >= 1.30.
What should consumers who are using k8s versions older than 1.30 do? This needs to be clearly mentioned. I think we will in general need to enhance the CRD apis to have a k8s version otherwise with this change etcd-druid can no longer work with k8s clusters with version < 1.30 which is going to be an issue.

So if we wish to have etcd-druid be consumed outside gardener then it is essential that we have some sort of validation for all the CRs. Unfortunately we do not have that today as well for non-gardener use cases. What should be our approach for non-gardener users using k8s version < 1.30? - do we say your CRs will never get validated at the time of submission to KAPI and you will only come to know of any problem during runtime or during provisioning an etcd cluster?

@@ -162,12 +162,13 @@ spec:
deltaSnapshotPeriod:
description: DeltaSnapshotPeriod defines the period after which
delta snapshots will be taken
pattern: ^([0-9]+([.][0-9]+)?h)?([0-9]+([.][0-9]+)?m)?([0-9]+([.][0-9]+)?s)?([0-9]+([.][0-9]+)?d)?$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you not introduced CEL expressions for EtcdCopyBackupsTask CRD? There should be uniformity in the way all CRs are validated, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@unmarshall I would prefer not increasing the scope of this PR now by introducing validations for EtcdCopyBackupsTask resources. We never ran validations for those, so we can continue live without validations for another release or two.

To clarify, gardener runs GRM webhook that uses the validation functions provided in api/validation/etcdcopybackupstask.go, but for open-source users of etcd-druid, currently there's no validation run for either Etcd or EtcdCopyBackupsTask resources, sicne druid doesn't run a validating webhook of its own against these resources. So this PR is simply a bonus for open-source users of etcd-druid. And as for gardener, if we choose to retain the the GRM webhook validation for a couple more druid releases, then we are completely safe. And even if we remove GRM webhook validation for k8s versions v1.30 and above, since we know the EtcdCopyBackupsTask resource is well-formed in gardener, it should be ok to live without validating these resources for another release or two, and only for k8s clusters running v1.30 or above.

I would rather want focus on #995 (comment) since that is more pressing right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shreyas-s14 can you confirm that you have tested with g/g locally setup that already has a GRM validating webhook enabled? Test it with the following combination:

  • CRD with CEL expressions & GRM webhook
  • CRD without CEL expressions & GRM webhook

I would prefer not increasing the scope of this PR now by introducing validations for EtcdCopyBackupsTask resources.

Sure. Please create an issue to add validations for EtcdCopyBackupsTask so that this is not forgotten. This should be done before we remove the validating webhooks from GRM.

@gardener-robot gardener-robot added the area/open-source Open Source (community, enablement, contributions, conferences, CNCF, etc.) related label Feb 28, 2025
@unmarshall unmarshall changed the title CRD Validation for etcd CRD Validation for Etcd resource using CEL expressions Feb 28, 2025
Shreyas-s14 and others added 3 commits February 28, 2025 15:31
1) Renamed "test/it/crd-validation" to "test/crdvalidation"
2) Removed Camel casing from the file names in test/crdvalidation/etcd
3) Changed version-compatibility matrix to 1.29 since CEL expressions are valid from this version onwards
4) Removed config/crd/bases/crd-druid.gardener.cloud_etcds.yaml
1) make prepare-helm-charts now takes an additional parameter to deploy the crd either with or without the CEL expressions based on the k8s version.
2) modified api/hack/generate.sh to generate 2 versions of the etcd crd. One with and one without CEL expressions
3) added deploy.sh which is called when make deploy is called. Additional functionality: get k8s version and deploy accordingly.
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 3, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 3, 2025
@gardener-robot-ci-1 gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 3, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 3, 2025
@gardener-robot-ci-3 gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 3, 2025
@gardener-robot-ci-2 gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 3, 2025
@gardener-robot-ci-1 gardener-robot-ci-1 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 3, 2025
@unmarshall
Copy link
Contributor

/retest

@gardener-robot-ci-2 gardener-robot-ci-2 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Mar 4, 2025
…lidation for requests field in etcd

1) Changed the Makefile target for dpeloy to install the dependecies if not present
2) Removed existing validations for fields of type resources where the request field was mandatory.
3) reverted the cron expressions in examples/etcd/druid_v1alpha1_etcd_localstack.yaml
4) exported the kubernetes version variable.
5) removed duplicate variable in hack/tools.mk
6) Added path to etcd crd without CEL as a return type in assets.go if the k8s version is < 1.29. Added the function to get k8s version as well
7) Changes to test/it/controller/etcd/reconciler_test.go and test/it/crdvalidation/etcd/validation_test.go based on changes to test/it/assets/assets.go
8) Skip test cases based on k8s versions
os.Exit(1)
}

k8sVersionAbove129, err = crds.IsK8sVersionAbove129(k8sVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you re-use crds.GetAll function?

…Versions + added comment(description) to exported function
…ed comment addressing duplication of code in test/it/assets/assets.go
Copy link

gardener-prow bot commented Mar 6, 2025

@Shreyas-s14: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-druid-e2e-kind ff23d1d link true /test pull-etcd-druid-e2e-kind

Full PR test history. Your PR dashboard. Command help for this repository.
Please help us cut down on flakes by linking this test failure to an open flake report or filing a new flake report if you can't find an existing one. Also see our testing guideline for how to avoid and hunt flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Control plane related area/open-source Open Source (community, enablement, contributions, conferences, CNCF, etc.) related area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension needs/changes Needs (more) changes needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review needs/second-opinion Needs second review by someone else size/xl Size of pull request is huge (see gardener-robot robot/bots/size.py)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CRD Validation for ETCD
9 participants