Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run manager with 2 replicas #180

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

k-keiichi-rh
Copy link
Contributor

@k-keiichi-rh k-keiichi-rh commented Jan 27, 2024

This pr is to run the manager with 2 replicas for high availability.
We are testing this if our issue is resolved in https://issues.redhat.com/browse/ECOPROJECT-1855.

Copy link
Contributor

openshift-ci bot commented Jan 27, 2024

Hi @k-keiichi-rh. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mshitrit mshitrit marked this pull request as draft January 28, 2024 07:58
@mshitrit
Copy link
Member

/test 4.14-openshift-e2e

@k-keiichi-rh k-keiichi-rh changed the title [WIP] Run manager with 2 replicas Run manager with 2 replicas Feb 22, 2024
Signed-off-by: Keiichi Kii <[email protected]>
Signed-off-by: Keiichi Kii <[email protected]>
@k-keiichi-rh
Copy link
Contributor Author

Just increasing replicas to 2 is not enough to resolve ECOPROJECT-1855. AntiAffinity is not configured to the deployment. So 2 of the controllers may be executed on the same node.
Now it's changed to configure podAntiAffinity according to https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

@mshitrit
Copy link
Member

/test 4.14-openshift-e2e

@mshitrit
Copy link
Member

Just increasing replicas to 2 is not enough to resolve ECOPROJECT-1855. AntiAffinity is not configured to the deployment. So 2 of the controllers may be executed on the same node.

That's an interesting point we should probably consider that for our other operators which are using the same approach.

However from what I understand the change (i.e anti-affinity for control-plane) only makes sure the manager pod isn't running on a control plane node, so I'm not sure how it would solve the problem.
Am I missing something ?

@k-keiichi-rh
Copy link
Contributor Author

k-keiichi-rh commented Feb 22, 2024

The following configuration ensures that no two pods with the label "control-plane: controller-manager" can't be scheduled on the same node(host).

              affinity:
                podAntiAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                      - key: control-plane
                        operator: In
                        values:
                        - controller-manager
                    topologyKey: kubernetes.io/hostname

For example, it's trying to deploy two manager pods, but the only one worker is available now.
In my env, the manager pod doesn't have a matching toleration with the "node-role.kubernetes.io/master", so the manager pods can be deployed on worker nodes only.

[root@lbdns self-node-remediation]# oc get node
NAME       STATUS                     ROLES                  AGE   VERSION
master-0   Ready                      control-plane,master   43d   v1.27.6+f67aeb3
master-1   Ready                      control-plane,master   43d   v1.27.6+f67aeb3
master-2   Ready                      control-plane,master   43d   v1.27.6+f67aeb3
worker-0   Ready                      worker                 43d   v1.27.6+f67aeb3
worker-1   Ready,SchedulingDisabled   worker                 43d   v1.27.6+f67aeb3
worker-2   Ready,SchedulingDisabled   worker                 43d   v1.27.6+f67aeb3

Before applying this patch, both of manager pods can be deployed on the worker-0, but now the above AntiAffinity is configured to the deployment. So the 2nd manager pod can't be deployed on the worker-0 and it becomes the "Pending" status.

[root@lbdns self-node-remediation]# oc get pods -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE     IP            NODE       NOMINATED NODE   READINESS GATES
self-node-remediation-controller-manager-88646cd74-cs5m8          1/1     Running     0          20m     10.128.2.69   worker-0   <none>           <none>
self-node-remediation-controller-manager-88646cd74-jbglq          0/1     Pending     0          2m29s   <none>        <none>     <none>           <none>

The 1st manager pod, self-node-remediation-controller-manager-88646cd74-cs5m8, is already deployed on the worker-0.
It has the "control-plane=controller-manager" label. So any pod with the "control-plane=controller-manager" label won't be deployed on the worker-0. This is the reason why the 2nd manager pod is in the "Pending" status.

[root@lbdns self-node-remediation]# oc describe pod self-node-remediation-controller-manager-88646cd74-cs5m8
Name:                 self-node-remediation-controller-manager-88646cd74-cs5m8
Namespace:            openshift-operators
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      self-node-remediation-controller-manager
Node:                 worker-0/192.168.28.50
Start Time:           Thu, 22 Feb 2024 16:10:31 +0000
Labels:               control-plane=controller-manager
                      pod-template-hash=88646cd74
                      self-node-remediation-operator=

We can confirm the events if the AntiAffinity is working well as expected.

[root@lbdns self-node-remediation]# oc describe pod self-node-remediation-controller-manager-88646cd74-jbglq
...
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  77s (x4 over 4m31s)  default-scheduler  0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 2 node(s) were unschedulable, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 5 Preemption is not helpful for scheduling..

After one more worker node(worker-1) is available, the 2nd manager pod will be scheduled on the worker-1 soon.

[root@lbdns self-node-remediation]# oc adm uncordon worker-1
node/worker-1 uncordoned
[root@lbdns self-node-remediation]# oc get pods -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
ed37de5c79174808125e3dbffd6aef0c79224569c22a7bd5a9056f7af1gh962   0/1     Completed   0          15h   10.128.2.10   worker-0   <none>           <none>
quay-io-kkii-self-node-remediation-operator-bundle-latest         1/1     Running     0          15h   10.128.2.9    worker-0   <none>           <none>
self-node-remediation-controller-manager-88646cd74-cs5m8          1/1     Running     0          42m   10.128.2.69   worker-0   <none>           <none>
self-node-remediation-controller-manager-88646cd74-jbglq          0/1     Running     0          24m   10.131.0.21   worker-1   <none>           <none>
self-node-remediation-ds-c74fx                                    1/1     Running     0          15h   10.129.2.17   worker-2   <none>           <none>
self-node-remediation-ds-jgjlj                                    1/1     Running     0          15h   10.131.0.16   worker-1   <none>           <none>
self-node-remediation-ds-nbslh                                    1/1     Running     0          15h   10.129.0.44   master-1   <none>           <none>
self-node-remediation-ds-p5fsv                                    1/1     Running     0          15h   10.128.0.20   master-0   <none>           <none>
self-node-remediation-ds-x4g7p                                    1/1     Running     0          15h   10.128.2.12   worker-0   <none>           <none>
self-node-remediation-ds-x8f67                                    1/1     Running     0          15h   10.130.0.40   master-2   <none>           <none>

@mshitrit
Copy link
Member

Thanks for the explanation, and really good job of catching this use-case !

/lgtm

Just a small Nit: I think you meant "can" instead of "can't"

no two pods with the label "control-plane: controller-manager" can't be

@openshift-ci openshift-ci bot added the lgtm label Feb 26, 2024
@mshitrit mshitrit marked this pull request as ready for review February 26, 2024 08:19
@openshift-ci openshift-ci bot requested review from mshitrit and slintes February 26, 2024 08:19
Copy link
Contributor

openshift-ci bot commented Feb 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: k-keiichi-rh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slintes
Copy link
Member

slintes commented Feb 26, 2024

/hold

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general good catch with the affinity, I have some questions / concerns though, see comments inside

@@ -337,6 +337,16 @@ spec:
control-plane: controller-manager
self-node-remediation-operator: ""
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this fail operator installation in case we only have one node available during installation?

Also, what happens on updates, when there aren't enough nodes to start the pods of the new version?

(the latter can be solved with matchLabelKeys and pod-template-hash, but that's alpha in k8s 1.29 only. PopologySpreadConstraints, also with matchLabelKeys and pod-template-hash, would also work AFAIU, but there matchLabelKeys also is too new, beta since k8s 1.27 / OCP 4.14).

Maybe preferredDuringSchedulingIgnoredDuringExecution is a better choice? It would potentially deploy both pods on the same node, but that's better than nothing, is it? However, I'm not sure if that would still make a difference to the default scheduling behaviour, which already respects replicasets... 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the update concern can also be solved with this:

strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

it will force k8s to first delete one of the pods before starting a new one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this fail operator installation in case we only have one node available during installation?

I will test it this case. In my opinion, I think it should fail to install the operator installation.
There are the following cases:

  1. We have only one node available like SNO(Single Node OpenShift)
    => No reason to deploy SNR/FAR/NHC
  2. We have a plan to deploy a cluster with multiple nodes, but there is only one node available during installation due to the node failures
    => We are unhappy to succeed to deploy the cluster without noticing the node failures. It should stop the operator installation.

looks like the update concern can also be solved with this:

strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

it will force k8s to first delete one of the pods before starting a new one

I will fix it with the above. And the Workload Availability project has the Node Maintenance Operator for maintenance. So we also need to define PodDisruptionBudget like https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/thanos-querier/pod-disruption-budget.yaml.
It will prevent from stopping all of the managers accidentally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have only one node available like SNO(Single Node OpenShift)
=> No reason to deploy SNR/FAR/NHC

agree

We have a plan to deploy a cluster with multiple nodes, but there is only one node available during installation due to the node failures
=> We are unhappy to succeed to deploy the cluster without noticing the node failures. It should stop the operator installation.

hm.... yes, actually it does not make sense to succeed, it wouldn't help fixing the cluster, because the agents would not be deployed on the failed nodes.

we also need to define PodDisruptionBudget

Let's discuss this in an issue please. As with this affinity, we need to be aware of side effects. Let's not make our operators more important than they are, compared to other deployments.

requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: control-plane
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

control-plane: controller-manager is a default label which is added by operator-sdk when scaffolding a new operator. So there is a very big risk that other operator's pods also have this label, which can lead to undesired behavior. Please use the self-node-remediation-operator: "" label, which is unique for SNR and also doesn't conflict with the daemonset pods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use the self-node-remediation-operator: "" label. Thank you for the suggestion.

@mshitrit
Copy link
Member

mshitrit commented Mar 7, 2024

/test ?

Copy link
Contributor

openshift-ci bot commented Mar 7, 2024

@mshitrit: The following commands are available to trigger required jobs:

  • /test 4.12-ci-index-self-node-remediation-bundle
  • /test 4.12-images
  • /test 4.12-openshift-e2e
  • /test 4.12-test
  • /test 4.13-ci-index-self-node-remediation-bundle
  • /test 4.13-images
  • /test 4.13-openshift-e2e
  • /test 4.13-test
  • /test 4.14-ci-index-self-node-remediation-bundle
  • /test 4.14-images
  • /test 4.14-openshift-e2e
  • /test 4.14-test
  • /test 4.15-ci-index-self-node-remediation-bundle
  • /test 4.15-images
  • /test 4.15-openshift-e2e
  • /test 4.15-test

Use /test all to run all jobs.

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mshitrit
Copy link
Member

mshitrit commented Mar 7, 2024

/test 4.14-openshift-e2e

Copy link
Contributor

openshift-ci bot commented May 27, 2024

@k-keiichi-rh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.16-openshift-e2e 136e5f9 link true /test 4.16-openshift-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants