-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Configuring CephCluster HealthCheck Settings in OCS Operator #2940
Add Support for Configuring CephCluster HealthCheck Settings in OCS Operator #2940
Conversation
071ffe5
to
58f23d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion in https://issues.redhat.com/browse/DFBUGS-565 for a different approach to expose this setting
58f23d6
to
8c04024
Compare
api/v1/storagecluster_types.go
Outdated
// cephCluster.spec.healthCheck.daemonHealth.mon.timeout in Rook Ceph. | ||
// The default is 10 minutes. This field allows customization or disabling | ||
// failover by setting the value to 0. | ||
MonTimeout string `json:"monTimeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the discussion in https://issues.redhat.com/browse/DFBUGS-565 I thought we agreed to go with the full spec instead of defining a new value, something like this:
MonTimeout string `json:"monTimeout,omitempty"` | |
HealthCheck rookCephv1.CephClusterHealthCheckSpec `json:"monTimeout,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
0e1d180
to
2f1cdc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a nit
@@ -503,6 +503,10 @@ func newCephCluster(sc *ocsv1.StorageCluster, cephImage string, kmsConfigMap *co | |||
cephCluster.Spec.DisruptionManagement.OSDMaintenanceTimeout = sc.Spec.ManagedResources.CephCluster.OsdMaintenanceTimeout | |||
} | |||
|
|||
if sc.Spec.ManagedResources.CephCluster.HealthCheck != nil { | |||
cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra space, though I'm curious why the CI didn't catch it with go vet
cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck | |
cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
$ go fmt /home/oviner/DEV_REPOS/ocs-operator/controllers/storagecluster/cephcluster.go
controllers/storagecluster/cephcluster.go
f0f4ec1
to
95a4367
Compare
c8c1f79
to
d23613b
Compare
d23613b
to
ba1cd01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold adding hold for testing, pls remove hold once it is tested |
@@ -503,6 +503,10 @@ func newCephCluster(sc *ocsv1.StorageCluster, cephImage string, kmsConfigMap *co | |||
cephCluster.Spec.DisruptionManagement.OSDMaintenanceTimeout = sc.Spec.ManagedResources.CephCluster.OsdMaintenanceTimeout | |||
} | |||
|
|||
if sc.Spec.ManagedResources.CephCluster.HealthCheck != nil { | |||
cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a user set nil to disable healthcheck? Is that allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nil
would cause Rook's defaults to be used. To disable the health checks, there are subsettings for enabled: false
. See the example here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@travisn in that case how someone can go back to the default? As there is a nil check here it won't allow for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the following comment yesterday, but missed submitting it. Looks like the conversation is already resolved though...
if sc.Spec.ManagedResources.CephCluster.HealthCheck == nil
, the CephCluster CR would be generated without the settings, which would cause Rook to use the defaults, the same as before this change. cephCluster.Spec.HealthCheck
is not nullable, so we can't set it to nil. Leaving the struct empty will effectively enable the defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to test this scenario on live cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was just wodering about the case where user created the StorageCluster without sc.Spec.ManagedResources.CephCluster.HealthCheck
and later user sets it in the storagecluster and gets the expected changes in the CephCluster and later user Removes the HealthCheck setting from the storagecluster which possibly sc.Spec.ManagedResources.CephCluster.HealthCheck
will be set to nil but ocs-operator doesn't revert back the changes on the cephcluster CR where in cephCluster CR it will still show the older values what ocs-operator added earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1 I did not test this scenario. I can deploy a new cluster and test it. In addition, I will update the code based on your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1 @travisn @iamniting
Some tests failed after removing the if condition sc.Spec.ManagedResources.CephCluster.HealthCheck == nil
.
Would you like to update the struct definition from:
HealthCheck *rookCephv1.CephClusterHealthCheckSpec \json:"healthCheck,omitempty"
to: HealthCheck rookCephv1.CephClusterHealthCheckSpec
json:"healthCheck,omitempty"``
to avoid working with a pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my comment above, we need that check for nil as you had it originally. Let's keep the pointer, so we can better know if the user has any customized settings or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@travisn @Madhu-1 @iamniting I have incorporated Madhu and Travis's feedback into the code. Once everyone has reviewed and agreed on the changes, I will proceed to test it on the OCP cluster with a private image.
ba1cd01
to
df4d39e
Compare
df4d39e
to
239a822
Compare
if sc.Spec.ManagedResources.CephCluster.HealthCheck != nil { | ||
cephCluster.Spec.HealthCheck = *sc.Spec.ManagedResources.CephCluster.HealthCheck | ||
} else { | ||
cephCluster.Spec.HealthCheck = rookCephv1.CephClusterHealthCheckSpec{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the cephCluster
struct is created directly above and the HealthCheck
is not a pointer, we actually don't need this else
block. The cephCluster.Spec.HealthCheck
will already be initialized to the empty struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@travisn @iamniting @Madhu-1 I have completed the code modifications. To validate these changes, I will perform the following steps on the OpenShift Container Platform (OCP) cluster using a private image:
-
Create a
StorageCluster
Custom Resource (CR) without theHealthCheck
block. -
Verify the corresponding
CephCluster
CR to ensure it reflects the absence of theHealthCheck
configuration. -
Add the
HealthCheck
block to the existingStorageCluster
CR. -
Check the
CephCluster
CR to confirm that theHealthCheck
settings have been applied correctly. -
Remove the
HealthCheck
block from theStorageCluster
CR. -
Inspect the
CephCluster
CR once more to verify that theHealthCheck
configuration has been removed as expected.
These steps will help ensure that the HealthCheck
functionality behaves as expected when added or removed from the StorageCluster
configuration.
Please let me know if I can proceed with testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a private image based on the 4.18 branch and install the OCS operator using the private image.
git checkout -b healthceck_test upstream/release-4.18
git cherry-pick bf7f3b10539fc0cd63bf49c2d640c37f9634cfa7
export REGISTRY_NAMESPACE=oviner
export IMAGE_TAG=test-health
make ocs-operator
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator:$IMAGE_TAG
make ocs-metrics-exporter
podman push quay.io/$REGISTRY_NAMESPACE/ocs-metrics-exporter:$IMAGE_TAG
make operator-bundle
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-bundle:$IMAGE_TAG
make operator-catalog
podman push quay.io/$REGISTRY_NAMESPACE/ocs-operator-catalog:$IMAGE_TAG
oc label nodes compute-0 cluster.ocs.openshift.io/openshift-storage=''
oc label nodes compute-1 cluster.ocs.openshift.io/openshift-storage=''
oc label nodes compute-2 cluster.ocs.openshift.io/openshift-storage=''
make install
$ oc get csv
NAME DISPLAY VERSION REPLACES PHASE
cephcsi-operator.v4.17.0 CephCSI operator 4.17.0 Succeeded
noobaa-operator.v5.18.0 NooBaa Operator 5.18.0 Succeeded
ocs-operator.v4.18.0 OpenShift Container Storage 4.18.0 Succeeded
rook-ceph-operator.v4.18.0 Rook-Ceph 4.18.0 Succeeded
$ oc get csv ocs-operator.v4.18.0 -o yaml | grep -i image:
containerImage: quay.io/oviner/ocs-operator:test-health
image: quay.io/oviner/ocs-operator:test-health
image: quay.io/oviner/ocs-operator:test-health
image: quay.io/openshift/origin-oauth-proxy:4.16.0
- image: quay.io/ocs-dev/rook-ceph:vmaster-793bbb006
- image: quay.io/ceph/ceph:v18.2.0
- image: quay.io/ocs-dev/ocs-must-gather:latest
- image: quay.io/oviner/ocs-metrics-exporter:test-health
- image: quay.io/openshift/origin-oauth-proxy:4.16.0
- image: gcr.io/kubebuilder/kube-rbac-proxy:v0.13.0
Test Procedure:
1.Create a StorageCluster Custom Resource (CR) without the HealthCheck block.
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
storageclass.kubernetes.io/is-default-class: "false"
name: thin-csi-odf
parameters:
StoragePolicyName: "vSAN Default Storage Policy"
provisioner: csi.vsphere.vmware.com
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: ocs.openshift.io/v1
kind: StorageCluster
metadata:
name: ocs-storagecluster
namespace: openshift-storage
spec:
resources:
mds:
Limits: null
Requests: null
mgr:
Limits: null
Requests: null
mon:
Limits: null
Requests: null
noobaa-core:
Limits: null
Requests: null
noobaa-db:
Limits: null
Requests: null
noobaa-endpoint:
limits:
cpu: 1
memory: 500Mi
requests:
cpu: 1
memory: 500Mi
rgw:
Limits: null
Requests: null
storageDeviceSets:
- count: 1
dataPVCTemplate:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 256Gi
storageClassName: thin-csi-odf
volumeMode: Block
name: ocs-deviceset
placement: {}
portable: true
replica: 3
resources:
Limits: null
Requests: null
2.Verify the corresponding CephCluster CR to ensure it reflects the absence of the HealthCheck configuration.
$ oc get cephclusters.ceph.rook.io -o yaml | grep healthCheck -A 5
healthCheck:
daemonHealth:
mon: {}
osd: {}
status: {}
3.Add the HealthCheck block to the existing StorageCluster CR
$ oc edit storagecluster ocs-storagecluster
spec:
managedResources:
cephBlockPools: {}
cephCluster:
healthCheck:
daemonHealth:
mon:
timeout: 120s
disabled: false
4.Check the CephCluster CR to confirm that the HealthCheck settings have been applied correctly.
$ oc get cephclusters.ceph.rook.io -o yaml | grep healthCheck -A 5
healthCheck:
daemonHealth:
mon:
timeout: 120s
osd: {}
status: {}
5.Remove the HealthCheck block from the StorageCluster CR.
spec:
managedResources:
cephCluster: {}
6.Inspect the CephCluster CR once more to verify that the HealthCheck configuration has been removed as expected.
oviner~/test$ oc get cephclusters.ceph.rook.io -o yaml | grep healthCheck -A 5
healthCheck:
daemonHealth:
mon: {}
osd: {}
status: {}
7.test livenessProbe and startupProbe params:
$ oc get storagecluster -o yaml | grep -i healthcheck -A 20
healthCheck:
livenessProbe:
mgr:
disabled: false
probe:
initialDelaySeconds: 3
periodSeconds: 3
startupProbe:
mgr:
disabled: false
probe:
failureThreshold: 30
initialDelaySeconds: 3
periodSeconds: 3
$ oc get cephcluster -o yaml | grep -i healthcheck -A 20
healthCheck:
daemonHealth:
mon: {}
osd: {}
status: {}
livenessProbe:
mgr:
probe:
initialDelaySeconds: 3
periodSeconds: 3
startupProbe:
mgr:
probe:
failureThreshold: 30
initialDelaySeconds: 3
periodSeconds: 3
@travisn @Madhu-1 @iamniting please review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iamniting can we merge it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The verification looks good to me, thanks!
239a822
to
f7894b0
Compare
enable advanced configuration of CephCluster healthCheck settings via StorageCluster spec. based on Rook CephClusterHealthCheckSpec type. Signed-off-by: Oded Viner <[email protected]>
f7894b0
to
bf7f3b1
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: iamniting, OdedViner, travisn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
e27b002
into
red-hat-storage:main
This PR introduces support for advanced users to configure the healthCheck settings of the
CephCluster
directly through theStorageCluster
spec. ThehealthCheck
is based on the RookCephClusterHealthCheckSpec
type, providing flexibility to adjust various health check parameters for Ceph daemons.