Add Support for Configuring CephCluster HealthCheck Settings in OCS Operator #2940

OdedViner · 2025-01-05T15:20:19Z

This PR introduces support for advanced users to configure the healthCheck settings of the CephCluster directly through the StorageCluster spec. The healthCheck is based on the Rook CephClusterHealthCheckSpec type, providing flexibility to adjust various health check parameters for Ceph daemons.

travisn

See discussion in https://issues.redhat.com/browse/DFBUGS-565 for a different approach to expose this setting

travisn · 2025-01-15T17:40:59Z

api/v1/storagecluster_types.go

+	// cephCluster.spec.healthCheck.daemonHealth.mon.timeout in Rook Ceph.
+	// The default is 10 minutes. This field allows customization or disabling
+	// failover by setting the value to 0.
+	MonTimeout string `json:"monTimeout,omitempty"`


From the discussion in https://issues.redhat.com/browse/DFBUGS-565 I thought we agreed to go with the full spec instead of defining a new value, something like this:

Suggested change

MonTimeout string `json:"monTimeout,omitempty"`

HealthCheck rookCephv1.CephClusterHealthCheckSpec `json:"monTimeout,omitempty"`

travisn

LGTM, just a nit

travisn · 2025-01-16T22:02:35Z

controllers/storagecluster/cephcluster.go

@@ -503,6 +503,10 @@ func newCephCluster(sc *ocsv1.StorageCluster, cephImage string, kmsConfigMap *co
 		cephCluster.Spec.DisruptionManagement.OSDMaintenanceTimeout = sc.Spec.ManagedResources.CephCluster.OsdMaintenanceTimeout
 	}

+	if sc.Spec.ManagedResources.CephCluster.HealthCheck != nil {
+		cephCluster.Spec.HealthCheck =  *sc.Spec.ManagedResources.CephCluster.HealthCheck


nit: extra space, though I'm curious why the CI didn't catch it with go vet

Suggested change

cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck

cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck

done

$ go fmt /home/oviner/DEV_REPOS/ocs-operator/controllers/storagecluster/cephcluster.go controllers/storagecluster/cephcluster.go

travisn

/lgtm

iamniting · 2025-01-21T15:12:21Z

/hold

adding hold for testing, pls remove hold once it is tested

Madhu-1 · 2025-01-21T15:15:50Z

controllers/storagecluster/cephcluster.go

@@ -503,6 +503,10 @@ func newCephCluster(sc *ocsv1.StorageCluster, cephImage string, kmsConfigMap *co
 		cephCluster.Spec.DisruptionManagement.OSDMaintenanceTimeout = sc.Spec.ManagedResources.CephCluster.OsdMaintenanceTimeout
 	}

+	if sc.Spec.ManagedResources.CephCluster.HealthCheck != nil {
+		cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck


Can a user set nil to disable healthcheck? Is that allowed?

nil would cause Rook's defaults to be used. To disable the health checks, there are subsettings for enabled: false. See the example here.

@travisn in that case how someone can go back to the default? As there is a nil check here it won't allow for that?

I had the following comment yesterday, but missed submitting it. Looks like the conversation is already resolved though...

if sc.Spec.ManagedResources.CephCluster.HealthCheck == nil, the CephCluster CR would be generated without the settings, which would cause Rook to use the defaults, the same as before this change. cephCluster.Spec.HealthCheck is not nullable, so we can't set it to nil. Leaving the struct empty will effectively enable the defaults.

I will try to test this scenario on live cluster

i was just wodering about the case where user created the StorageCluster without sc.Spec.ManagedResources.CephCluster.HealthCheck and later user sets it in the storagecluster and gets the expected changes in the CephCluster and later user Removes the HealthCheck setting from the storagecluster which possibly sc.Spec.ManagedResources.CephCluster.HealthCheck will be set to nil but ocs-operator doesn't revert back the changes on the cephcluster CR where in cephCluster CR it will still show the older values what ocs-operator added earlier.

@Madhu-1 I did not test this scenario. I can deploy a new cluster and test it. In addition, I will update the code based on your comment.

@Madhu-1 @travisn @iamniting
Some tests failed after removing the if condition sc.Spec.ManagedResources.CephCluster.HealthCheck == nil.
Would you like to update the struct definition from:
HealthCheck *rookCephv1.CephClusterHealthCheckSpec \json:"healthCheck,omitempty" to: HealthCheck rookCephv1.CephClusterHealthCheckSpec json:"healthCheck,omitempty"``
to avoid working with a pointer?

Per my comment above, we need that check for nil as you had it originally. Let's keep the pointer, so we can better know if the user has any customized settings or not.

@travisn @Madhu-1 @iamniting I have incorporated Madhu and Travis's feedback into the code. Once everyone has reviewed and agreed on the changes, I will proceed to test it on the OCP cluster with a private image.

travisn · 2025-01-22T20:36:17Z

controllers/storagecluster/cephcluster.go

+	if sc.Spec.ManagedResources.CephCluster.HealthCheck != nil {
+		cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck
+	} else {
+		cephCluster.Spec.HealthCheck = rookCephv1.CephClusterHealthCheckSpec{}


Since the cephCluster struct is created directly above and the HealthCheck is not a pointer, we actually don't need this else block. The cephCluster.Spec.HealthCheck will already be initialized to the empty struct.

@travisn @iamniting @Madhu-1 I have completed the code modifications. To validate these changes, I will perform the following steps on the OpenShift Container Platform (OCP) cluster using a private image:

Create a StorageCluster Custom Resource (CR) without the HealthCheck block.

Verify the corresponding CephCluster CR to ensure it reflects the absence of the HealthCheck configuration.

Add the HealthCheck block to the existing StorageCluster CR.

Check the CephCluster CR to confirm that the HealthCheck settings have been applied correctly.

Remove the HealthCheck block from the StorageCluster CR.

Inspect the CephCluster CR once more to verify that the HealthCheck configuration has been removed as expected.

These steps will help ensure that the HealthCheck functionality behaves as expected when added or removed from the StorageCluster configuration.

Please let me know if I can proceed with testing.

LGTM, thanks

Create a private image based on the 4.18 branch and install the OCS operator using the private image.

git checkout -b healthceck_test upstream/release-4.18 git cherry-pick bf7f3b10539fc0cd63bf49c2d640c37f9634cfa7 export REGISTRY_NAMESPACE=oviner export IMAGE_TAG=test-health make ocs-operator podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator:$IMAGE_TAG make ocs-metrics-exporter podman push quay.io/$REGISTRY_NAMESPACE/ocs-metrics-exporter:$IMAGE_TAG make operator-bundle podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-bundle:$IMAGE_TAG make operator-catalog podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-catalog:$IMAGE_TAG oc label nodes compute-0 cluster.ocs.openshift.io/openshift-storage='' oc label nodes compute-1 cluster.ocs.openshift.io/openshift-storage='' oc label nodes compute-2 cluster.ocs.openshift.io/openshift-storage='' make install

$ oc get csv NAME DISPLAY VERSION REPLACES PHASE cephcsi-operator.v4.17.0 CephCSI operator 4.17.0 Succeeded noobaa-operator.v5.18.0 NooBaa Operator 5.18.0 Succeeded ocs-operator.v4.18.0 OpenShift Container Storage 4.18.0 Succeeded rook-ceph-operator.v4.18.0 Rook-Ceph 4.18.0 Succeeded

$ oc get csv ocs-operator.v4.18.0 -o yaml | grep -i image: containerImage: quay.io/oviner/ocs-operator:test-health image: quay.io/oviner/ocs-operator:test-health image: quay.io/oviner/ocs-operator:test-health image: quay.io/openshift/origin-oauth-proxy:4.16.0 - image: quay.io/ocs-dev/rook-ceph:vmaster-793bbb006 - image: quay.io/ceph/ceph:v18.2.0 - image: quay.io/ocs-dev/ocs-must-gather:latest - image: quay.io/oviner/ocs-metrics-exporter:test-health - image: quay.io/openshift/origin-oauth-proxy:4.16.0 - image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.0

Test Procedure:
1.Create a StorageCluster Custom Resource (CR) without the HealthCheck block.

--- apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.kubernetes.io/is-default-class: "false" name: thin-csi-odf parameters: StoragePolicyName: "vSAN Default Storage Policy" provisioner: csi.vsphere.vmware.com allowVolumeExpansion: true reclaimPolicy: Delete volumeBindingMode: WaitForFirstConsumer

--- apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: name: ocs-storagecluster namespace: openshift-storage spec: resources: mds: Limits: null Requests: null mgr: Limits: null Requests: null mon: Limits: null Requests: null noobaa-core: Limits: null Requests: null noobaa-db: Limits: null Requests: null noobaa-endpoint: limits: cpu: 1 memory: 500Mi requests: cpu: 1 memory: 500Mi rgw: Limits: null Requests: null storageDeviceSets: - count: 1 dataPVCTemplate: spec: accessModes: - ReadWriteOnce resources: requests: storage: 256Gi storageClassName: thin-csi-odf volumeMode: Block name: ocs-deviceset placement: {} portable: true replica: 3 resources: Limits: null Requests: null

2.Verify the corresponding CephCluster CR to ensure it reflects the absence of the HealthCheck configuration.

$ oc get cephclusters.ceph.rook.io -o yaml | grep healthCheck -A 5 healthCheck: daemonHealth: mon: {} osd: {} status: {}

3.Add the HealthCheck block to the existing StorageCluster CR

$ oc edit storagecluster ocs-storagecluster spec: managedResources: cephBlockPools: {} cephCluster: healthCheck: daemonHealth: mon: timeout: 120s disabled: false

4.Check the CephCluster CR to confirm that the HealthCheck settings have been applied correctly.

$ oc get cephclusters.ceph.rook.io -o yaml | grep healthCheck -A 5 healthCheck: daemonHealth: mon: timeout: 120s osd: {} status: {}

5.Remove the HealthCheck block from the StorageCluster CR.

spec: managedResources: cephCluster: {}

6.Inspect the CephCluster CR once more to verify that the HealthCheck configuration has been removed as expected.

oviner~/test$ oc get cephclusters.ceph.rook.io -o yaml | grep healthCheck -A 5 healthCheck: daemonHealth: mon: {} osd: {} status: {}

7.test livenessProbe and startupProbe params:

$ oc get storagecluster -o yaml | grep -i healthcheck -A 20 healthCheck: livenessProbe: mgr: disabled: false probe: initialDelaySeconds: 3 periodSeconds: 3 startupProbe: mgr: disabled: false probe: failureThreshold: 30 initialDelaySeconds: 3 periodSeconds: 3

$ oc get cephcluster -o yaml | grep -i healthcheck -A 20 healthCheck: daemonHealth: mon: {} osd: {} status: {} livenessProbe: mgr: probe: initialDelaySeconds: 3 periodSeconds: 3 startupProbe: mgr: probe: failureThreshold: 30 initialDelaySeconds: 3 periodSeconds: 3

@travisn @Madhu-1 @iamniting please review

@iamniting can we merge it?

The verification looks good to me, thanks!

enable advanced configuration of CephCluster healthCheck settings via StorageCluster spec. based on Rook CephClusterHealthCheckSpec type. Signed-off-by: Oded Viner <[email protected]>

openshift-ci · 2025-01-27T05:24:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iamniting, OdedViner, travisn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [iamniting,travisn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

iamniting · 2025-01-27T09:10:48Z

/unhold

OdedViner force-pushed the sc_mon_time_out branch 9 times, most recently from 071ffe5 to 58f23d6 Compare January 5, 2025 17:00

travisn requested changes Jan 6, 2025

View reviewed changes

openshift-ci bot assigned travisn Jan 6, 2025

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2025

OdedViner force-pushed the sc_mon_time_out branch from 58f23d6 to 8c04024 Compare January 15, 2025 17:22

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 15, 2025

travisn requested changes Jan 15, 2025

View reviewed changes

OdedViner force-pushed the sc_mon_time_out branch 8 times, most recently from 0e1d180 to 2f1cdc2 Compare January 16, 2025 16:41

travisn reviewed Jan 16, 2025

View reviewed changes

OdedViner force-pushed the sc_mon_time_out branch 2 times, most recently from f0f4ec1 to 95a4367 Compare January 19, 2025 11:46

OdedViner changed the title ~~Add monTimeout param in storagecluster CR~~ Add Support for Configuring CephCluster HealthCheck Settings in OCS Operator Jan 19, 2025

OdedViner force-pushed the sc_mon_time_out branch 3 times, most recently from c8c1f79 to d23613b Compare January 19, 2025 12:03

iamniting approved these changes Jan 19, 2025

View reviewed changes

openshift-ci bot assigned iamniting Jan 19, 2025

openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 19, 2025

OdedViner force-pushed the sc_mon_time_out branch from d23613b to ba1cd01 Compare January 19, 2025 12:32

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 19, 2025

OdedViner requested a review from travisn January 20, 2025 13:21

travisn approved these changes Jan 21, 2025

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2025

Madhu-1 reviewed Jan 21, 2025

View reviewed changes

OdedViner force-pushed the sc_mon_time_out branch from ba1cd01 to df4d39e Compare January 22, 2025 15:32

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2025

OdedViner force-pushed the sc_mon_time_out branch from df4d39e to 239a822 Compare January 22, 2025 20:33

travisn reviewed Jan 22, 2025

View reviewed changes

OdedViner force-pushed the sc_mon_time_out branch from 239a822 to f7894b0 Compare January 22, 2025 20:42

Add support for configuring CephCluster healthCheck

bf7f3b1

enable advanced configuration of CephCluster healthCheck settings via StorageCluster spec. based on Rook CephClusterHealthCheckSpec type. Signed-off-by: Oded Viner <[email protected]>

OdedViner force-pushed the sc_mon_time_out branch from f7894b0 to bf7f3b1 Compare January 23, 2025 12:22

iamniting approved these changes Jan 27, 2025

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2025

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 27, 2025

openshift-merge-bot bot merged commit e27b002 into red-hat-storage:main Jan 27, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Configuring CephCluster HealthCheck Settings in OCS Operator #2940

Add Support for Configuring CephCluster HealthCheck Settings in OCS Operator #2940

OdedViner commented Jan 5, 2025 •

edited

Loading

travisn left a comment

travisn Jan 15, 2025 •

edited

Loading

OdedViner Jan 19, 2025

travisn left a comment

travisn Jan 16, 2025

OdedViner Jan 19, 2025

travisn left a comment

iamniting commented Jan 21, 2025

Madhu-1 Jan 21, 2025

travisn Jan 21, 2025

Madhu-1 Jan 21, 2025

travisn Jan 21, 2025

OdedViner Jan 22, 2025

Madhu-1 Jan 22, 2025

OdedViner Jan 22, 2025

OdedViner Jan 22, 2025

travisn Jan 22, 2025

OdedViner Jan 22, 2025

travisn Jan 22, 2025

OdedViner Jan 22, 2025

travisn Jan 22, 2025

OdedViner Jan 23, 2025 •

edited

Loading

OdedViner Jan 23, 2025

travisn Jan 23, 2025

openshift-ci bot commented Jan 27, 2025

iamniting commented Jan 27, 2025

	MonTimeout string `json:"monTimeout,omitempty"`
	HealthCheck rookCephv1.CephClusterHealthCheckSpec `json:"monTimeout,omitempty"`

	cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck
	cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck

Add Support for Configuring CephCluster HealthCheck Settings in OCS Operator #2940

Add Support for Configuring CephCluster HealthCheck Settings in OCS Operator #2940

Conversation

OdedViner commented Jan 5, 2025 • edited Loading

travisn left a comment

Choose a reason for hiding this comment

travisn Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisn left a comment

Choose a reason for hiding this comment

iamniting commented Jan 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

OdedViner Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Jan 27, 2025

iamniting commented Jan 27, 2025

OdedViner commented Jan 5, 2025 •

edited

Loading

travisn Jan 15, 2025 •

edited

Loading

OdedViner Jan 23, 2025 •

edited

Loading