Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Configuring CephCluster HealthCheck Settings in OCS Operator #2940

Merged
merged 1 commit into from
Jan 27, 2025

Conversation

OdedViner
Copy link
Contributor

@OdedViner OdedViner commented Jan 5, 2025

This PR introduces support for advanced users to configure the healthCheck settings of the CephCluster directly through the StorageCluster spec. The healthCheck is based on the Rook CephClusterHealthCheckSpec type, providing flexibility to adjust various health check parameters for Ceph daemons.

@OdedViner OdedViner force-pushed the sc_mon_time_out branch 9 times, most recently from 071ffe5 to 58f23d6 Compare January 5, 2025 17:00
Copy link
Contributor

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion in https://issues.redhat.com/browse/DFBUGS-565 for a different approach to expose this setting

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 15, 2025
// cephCluster.spec.healthCheck.daemonHealth.mon.timeout in Rook Ceph.
// The default is 10 minutes. This field allows customization or disabling
// failover by setting the value to 0.
MonTimeout string `json:"monTimeout,omitempty"`
Copy link
Contributor

@travisn travisn Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the discussion in https://issues.redhat.com/browse/DFBUGS-565 I thought we agreed to go with the full spec instead of defining a new value, something like this:

Suggested change
MonTimeout string `json:"monTimeout,omitempty"`
HealthCheck rookCephv1.CephClusterHealthCheckSpec `json:"monTimeout,omitempty"`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@OdedViner OdedViner force-pushed the sc_mon_time_out branch 8 times, most recently from 0e1d180 to 2f1cdc2 Compare January 16, 2025 16:41
Copy link
Contributor

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a nit

@@ -503,6 +503,10 @@ func newCephCluster(sc *ocsv1.StorageCluster, cephImage string, kmsConfigMap *co
cephCluster.Spec.DisruptionManagement.OSDMaintenanceTimeout = sc.Spec.ManagedResources.CephCluster.OsdMaintenanceTimeout
}

if sc.Spec.ManagedResources.CephCluster.HealthCheck != nil {
cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space, though I'm curious why the CI didn't catch it with go vet

Suggested change
cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck
cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

$ go fmt /home/oviner/DEV_REPOS/ocs-operator/controllers/storagecluster/cephcluster.go
controllers/storagecluster/cephcluster.go

@OdedViner OdedViner force-pushed the sc_mon_time_out branch 2 times, most recently from f0f4ec1 to 95a4367 Compare January 19, 2025 11:46
@OdedViner OdedViner changed the title Add monTimeout param in storagecluster CR Add Support for Configuring CephCluster HealthCheck Settings in OCS Operator Jan 19, 2025
@OdedViner OdedViner force-pushed the sc_mon_time_out branch 3 times, most recently from c8c1f79 to d23613b Compare January 19, 2025 12:03
@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 19, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 19, 2025
@OdedViner OdedViner requested a review from travisn January 20, 2025 13:21
Copy link
Contributor

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2025
@iamniting
Copy link
Member

/hold

adding hold for testing, pls remove hold once it is tested

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2025
@@ -503,6 +503,10 @@ func newCephCluster(sc *ocsv1.StorageCluster, cephImage string, kmsConfigMap *co
cephCluster.Spec.DisruptionManagement.OSDMaintenanceTimeout = sc.Spec.ManagedResources.CephCluster.OsdMaintenanceTimeout
}

if sc.Spec.ManagedResources.CephCluster.HealthCheck != nil {
cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a user set nil to disable healthcheck? Is that allowed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil would cause Rook's defaults to be used. To disable the health checks, there are subsettings for enabled: false. See the example here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travisn in that case how someone can go back to the default? As there is a nil check here it won't allow for that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the following comment yesterday, but missed submitting it. Looks like the conversation is already resolved though...

if sc.Spec.ManagedResources.CephCluster.HealthCheck == nil, the CephCluster CR would be generated without the settings, which would cause Rook to use the defaults, the same as before this change. cephCluster.Spec.HealthCheck is not nullable, so we can't set it to nil. Leaving the struct empty will effectively enable the defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to test this scenario on live cluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was just wodering about the case where user created the StorageCluster without sc.Spec.ManagedResources.CephCluster.HealthCheck and later user sets it in the storagecluster and gets the expected changes in the CephCluster and later user Removes the HealthCheck setting from the storagecluster which possibly sc.Spec.ManagedResources.CephCluster.HealthCheck will be set to nil but ocs-operator doesn't revert back the changes on the cephcluster CR where in cephCluster CR it will still show the older values what ocs-operator added earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 I did not test this scenario. I can deploy a new cluster and test it. In addition, I will update the code based on your comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 @travisn @iamniting
Some tests failed after removing the if condition sc.Spec.ManagedResources.CephCluster.HealthCheck == nil.
Would you like to update the struct definition from:
HealthCheck *rookCephv1.CephClusterHealthCheckSpec \json:"healthCheck,omitempty" to: HealthCheck rookCephv1.CephClusterHealthCheckSpec json:"healthCheck,omitempty"``
to avoid working with a pointer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my comment above, we need that check for nil as you had it originally. Let's keep the pointer, so we can better know if the user has any customized settings or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travisn @Madhu-1 @iamniting I have incorporated Madhu and Travis's feedback into the code. Once everyone has reviewed and agreed on the changes, I will proceed to test it on the OCP cluster with a private image.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2025
if sc.Spec.ManagedResources.CephCluster.HealthCheck != nil {
cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck
} else {
cephCluster.Spec.HealthCheck = rookCephv1.CephClusterHealthCheckSpec{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the cephCluster struct is created directly above and the HealthCheck is not a pointer, we actually don't need this else block. The cephCluster.Spec.HealthCheck will already be initialized to the empty struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travisn @iamniting @Madhu-1 I have completed the code modifications. To validate these changes, I will perform the following steps on the OpenShift Container Platform (OCP) cluster using a private image:

  1. Create a StorageCluster Custom Resource (CR) without the HealthCheck block.

  2. Verify the corresponding CephCluster CR to ensure it reflects the absence of the HealthCheck configuration.

  3. Add the HealthCheck block to the existing StorageCluster CR.

  4. Check the CephCluster CR to confirm that the HealthCheck settings have been applied correctly.

  5. Remove the HealthCheck block from the StorageCluster CR.

  6. Inspect the CephCluster CR once more to verify that the HealthCheck configuration has been removed as expected.

These steps will help ensure that the HealthCheck functionality behaves as expected when added or removed from the StorageCluster configuration.

Please let me know if I can proceed with testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Copy link
Contributor Author

@OdedViner OdedViner Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a private image based on the 4.18 branch and install the OCS operator using the private image.

git checkout -b healthceck_test upstream/release-4.18 
git cherry-pick bf7f3b10539fc0cd63bf49c2d640c37f9634cfa7
export REGISTRY_NAMESPACE=oviner
export IMAGE_TAG=test-health
make ocs-operator
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator:$IMAGE_TAG
make ocs-metrics-exporter
podman push quay.io/$REGISTRY_NAMESPACE/ocs-metrics-exporter:$IMAGE_TAG
make operator-bundle
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-bundle:$IMAGE_TAG
make operator-catalog
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-catalog:$IMAGE_TAG
oc label nodes compute-0 cluster.ocs.openshift.io/openshift-storage=''
oc label nodes compute-1 cluster.ocs.openshift.io/openshift-storage=''
oc label nodes compute-2 cluster.ocs.openshift.io/openshift-storage=''
make install
$ oc get csv 
NAME                         DISPLAY                       VERSION   REPLACES   PHASE
cephcsi-operator.v4.17.0     CephCSI operator              4.17.0               Succeeded
noobaa-operator.v5.18.0      NooBaa Operator               5.18.0               Succeeded
ocs-operator.v4.18.0         OpenShift Container Storage   4.18.0               Succeeded
rook-ceph-operator.v4.18.0   Rook-Ceph                     4.18.0               Succeeded
$ oc get csv ocs-operator.v4.18.0 -o yaml | grep -i image:
    containerImage: quay.io/oviner/ocs-operator:test-health
                image: quay.io/oviner/ocs-operator:test-health
                image: quay.io/oviner/ocs-operator:test-health
                image: quay.io/openshift/origin-oauth-proxy:4.16.0
  - image: quay.io/ocs-dev/rook-ceph:vmaster-793bbb006
  - image: quay.io/ceph/ceph:v18.2.0
  - image: quay.io/ocs-dev/ocs-must-gather:latest
  - image: quay.io/oviner/ocs-metrics-exporter:test-health
  - image: quay.io/openshift/origin-oauth-proxy:4.16.0
  - image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.0

Test Procedure:
1.Create a StorageCluster Custom Resource (CR) without the HealthCheck block.

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
  name: thin-csi-odf
parameters:
  StoragePolicyName: "vSAN Default Storage Policy"
provisioner: csi.vsphere.vmware.com
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

---
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
  name: ocs-storagecluster
  namespace: openshift-storage
spec:
  resources:
    mds:
      Limits: null
      Requests: null
    mgr:
      Limits: null
      Requests: null
    mon:
      Limits: null
      Requests: null
    noobaa-core:
      Limits: null
      Requests: null
    noobaa-db:
      Limits: null
      Requests: null
    noobaa-endpoint:
      limits:
        cpu: 1
        memory: 500Mi
      requests:
        cpu: 1
        memory: 500Mi
    rgw:
      Limits: null
      Requests: null
  storageDeviceSets:
  - count: 1
    dataPVCTemplate:
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 256Gi
        storageClassName: thin-csi-odf
        volumeMode: Block
    name: ocs-deviceset
    placement: {}
    portable: true
    replica: 3
    resources:
      Limits: null
      Requests: null

2.Verify the corresponding CephCluster CR to ensure it reflects the absence of the HealthCheck configuration.

$ oc get cephclusters.ceph.rook.io -o yaml | grep healthCheck -A 5
    healthCheck:
      daemonHealth:
        mon: {}
        osd: {}
        status: {}

3.Add the HealthCheck block to the existing StorageCluster CR

$ oc edit storagecluster ocs-storagecluster 
spec:
  managedResources:
    cephBlockPools: {}
    cephCluster:
      healthCheck:
        daemonHealth:
          mon:
            timeout: 120s
            disabled: false

4.Check the CephCluster CR to confirm that the HealthCheck settings have been applied correctly.

$ oc get cephclusters.ceph.rook.io -o yaml | grep healthCheck -A 5
    healthCheck:
      daemonHealth:
        mon:
          timeout: 120s
        osd: {}
        status: {}

5.Remove the HealthCheck block from the StorageCluster CR.

spec:
  managedResources:
    cephCluster: {}

6.Inspect the CephCluster CR once more to verify that the HealthCheck configuration has been removed as expected.

oviner~/test$ oc get cephclusters.ceph.rook.io -o yaml | grep healthCheck -A 5
    healthCheck:
      daemonHealth:
        mon: {}
        osd: {}
        status: {}

7.test livenessProbe and startupProbe params:

$ oc get storagecluster -o yaml | grep -i healthcheck -A 20
        healthCheck:
          livenessProbe:
            mgr:
              disabled: false
              probe:
                initialDelaySeconds: 3
                periodSeconds: 3
          startupProbe:
            mgr:
              disabled: false
              probe:
                failureThreshold: 30
                initialDelaySeconds: 3
                periodSeconds: 3
$ oc get cephcluster -o yaml | grep -i healthcheck -A 20
    healthCheck:
      daemonHealth:
        mon: {}
        osd: {}
        status: {}
      livenessProbe:
        mgr:
          probe:
            initialDelaySeconds: 3
            periodSeconds: 3
      startupProbe:
        mgr:
          probe:
            failureThreshold: 30
            initialDelaySeconds: 3
            periodSeconds: 3

@travisn @Madhu-1 @iamniting please review

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iamniting can we merge it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The verification looks good to me, thanks!

enable advanced configuration of CephCluster healthCheck settings
via StorageCluster spec.
based on Rook CephClusterHealthCheckSpec type.

Signed-off-by: Oded Viner <[email protected]>
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 27, 2025
Copy link
Contributor

openshift-ci bot commented Jan 27, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iamniting, OdedViner, travisn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@iamniting
Copy link
Member

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 27, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit e27b002 into red-hat-storage:main Jan 27, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants