From c73dca1234bf02d897426cf791f35289402d5a7d Mon Sep 17 00:00:00 2001 From: Tim Collins Date: Mon, 10 Feb 2025 15:56:48 +0000 Subject: [PATCH 1/6] docs: Update HA documentation Signed-off-by: Tim Collins --- docs/high-availability.md | 36 +++++++++++++++++++++++++----------- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/docs/high-availability.md b/docs/high-availability.md index c1a78de76ee7..d7aeb68effff 100644 --- a/docs/high-availability.md +++ b/docs/high-availability.md @@ -2,23 +2,37 @@ ## Workflow Controller -Before v3.0, only one controller could run at once. (If it crashed, Kubernetes would start another pod.) +In the event of a Workflow Controller pod failure, the replacement Controller pod will continue running Workflows when it is created. +In most cases, this short loss of Workflow Controller service may be acceptable. -> v3.0 and after +If you run a single replica of the Workflow Controller, ensure that the [environment variable](environment-variables.md#controller) `LEADER_ELECTION_DISABLE` is set to `true` and that the pod uses the `workflow-controller` [Priority Class](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) included in the installation manifests. -For many users, a short loss of workflow service may be acceptable - the new controller will just continue running -workflows if it restarts. However, with high service guarantees, new pods may take too long to start running workflows. -You should run two replicas, and one of which will be kept on hot-standby. +By disabling the leader election process, you can avoid unnecessary communication with the Kubernetes API, which may become unresponsive when running Workflows at scale. -A voluntary pod disruption can cause both replicas to be replaced at the same time. You should use a Pod Disruption -Budget to prevent this and Pod Priority to recover faster from an involuntary pod disruption: +By using the `PriorityClass`, you can ensure that the Workflow Controller pod is scheduled before other pods in the cluster. -* [Pod Disruption Budget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) -* [Pod Priority](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) +### Multiple Workflow Controller Replicas -## Argo Server +It is possible to run multiple replicas of the Workflow Controller to provide high-availability. Ensure that leader election is enabled (either by omitting the `LEADER_ELECTION_DISABLE` or setting it to `false`). + +Only one replica of the Workflow Controller will actively manage workflows at any given time. +The other replicas will be on standby, ready to take over if the active replica fails. +This means that you are guaranteeing resource allocations for replicas that are not actively contributing to the running of workflows. + +The leader election process requires frequent communication with the Kubernetes API. +When running workflows at scale, the Kubernetes API may become unresponsive, causing the leader election to take longer than 10 seconds (`LEADER_ELECTION_RENEW_DEADLINE`) to respond, which will disrupt the controller. + +Even with multiple replicas, a voluntary pod disruption can cause both replicas to be replaced simultaneously. +Use a [Pod Disruption Budget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) to prevent this. -> v2.6 and after +### Considerations + +A single replica of the Workflow Controller is recommended for most use cases due to: + +- The time to re-provision the controller pod is often faster than the time for an existing pod to win a leader election, especially when the cluster is under load. +- You save on the cost of extra Kubernetes resource allocations that aren't being used. + +## Argo Server Run a minimum of two replicas, typically three, should be run, otherwise it may be possible that API and webhook requests are dropped. From d9616675a0b4817cce96649fbac7e2f822252ef1 Mon Sep 17 00:00:00 2001 From: Tim Collins Date: Mon, 10 Feb 2025 16:00:08 +0000 Subject: [PATCH 2/6] new line Signed-off-by: Tim Collins --- docs/high-availability.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/high-availability.md b/docs/high-availability.md index d7aeb68effff..e4f224134b0b 100644 --- a/docs/high-availability.md +++ b/docs/high-availability.md @@ -13,7 +13,8 @@ By using the `PriorityClass`, you can ensure that the Workflow Controller pod is ### Multiple Workflow Controller Replicas -It is possible to run multiple replicas of the Workflow Controller to provide high-availability. Ensure that leader election is enabled (either by omitting the `LEADER_ELECTION_DISABLE` or setting it to `false`). +It is possible to run multiple replicas of the Workflow Controller to provide high-availability +Ensure that leader election is enabled (either by omitting the `LEADER_ELECTION_DISABLE` or setting it to `false`). Only one replica of the Workflow Controller will actively manage workflows at any given time. The other replicas will be on standby, ready to take over if the active replica fails. From 85dea7a945de4bbcd2f507ce3d58992564ca37d4 Mon Sep 17 00:00:00 2001 From: Tim Collins Date: Mon, 10 Feb 2025 16:00:40 +0000 Subject: [PATCH 3/6] fullstop Signed-off-by: Tim Collins --- docs/high-availability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/high-availability.md b/docs/high-availability.md index e4f224134b0b..f0d1bd47d263 100644 --- a/docs/high-availability.md +++ b/docs/high-availability.md @@ -13,7 +13,7 @@ By using the `PriorityClass`, you can ensure that the Workflow Controller pod is ### Multiple Workflow Controller Replicas -It is possible to run multiple replicas of the Workflow Controller to provide high-availability +It is possible to run multiple replicas of the Workflow Controller to provide high-availability. Ensure that leader election is enabled (either by omitting the `LEADER_ELECTION_DISABLE` or setting it to `false`). Only one replica of the Workflow Controller will actively manage workflows at any given time. From 3f49aaac943487193d5b738b2827f8e840d4ee27 Mon Sep 17 00:00:00 2001 From: Tim Collins Date: Mon, 10 Feb 2025 16:01:46 +0000 Subject: [PATCH 4/6] english Signed-off-by: Tim Collins --- docs/high-availability.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/high-availability.md b/docs/high-availability.md index f0d1bd47d263..390785c933f7 100644 --- a/docs/high-availability.md +++ b/docs/high-availability.md @@ -30,8 +30,8 @@ Use a [Pod Disruption Budget](https://kubernetes.io/docs/concepts/workloads/pods A single replica of the Workflow Controller is recommended for most use cases due to: -- The time to re-provision the controller pod is often faster than the time for an existing pod to win a leader election, especially when the cluster is under load. -- You save on the cost of extra Kubernetes resource allocations that aren't being used. +- The time taken to re-provision the controller pod often being faster than the time for an existing pod to win a leader election, especially when the cluster is under load. +- Saving on the cost of extra Kubernetes resource allocations that aren't being used. ## Argo Server From f8bb420f129b16ce401d68e356f0e65feeed8182 Mon Sep 17 00:00:00 2001 From: Tim Collins Date: Mon, 10 Feb 2025 16:15:14 +0000 Subject: [PATCH 5/6] more k8s docs conformity. Improve server docs. Deduplicate. Signed-off-by: Tim Collins --- docs/high-availability.md | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/docs/high-availability.md b/docs/high-availability.md index 390785c933f7..39ab12f834a5 100644 --- a/docs/high-availability.md +++ b/docs/high-availability.md @@ -1,41 +1,43 @@ # High-Availability (HA) +By default, the Workflow Controller Pod(s) and the Argo Server Pod(s) do not have resource requests or limits configured. +Set resource requests to guarantee a resource allocation appropriate for your workloads. + +When you use multiple replicas of the same deployment, spread the Pods across multiple availability zones. +At a minimum, ensure that the Pods are not scheduled on the same node. + +Use a [Pod Disruption Budget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) to prevent all replicas from being replaced simultaneously. + + ## Workflow Controller -In the event of a Workflow Controller pod failure, the replacement Controller pod will continue running Workflows when it is created. +In the event of a Workflow Controller Pod failure, the replacement Controller Pod will continue running Workflows when it is created. In most cases, this short loss of Workflow Controller service may be acceptable. -If you run a single replica of the Workflow Controller, ensure that the [environment variable](environment-variables.md#controller) `LEADER_ELECTION_DISABLE` is set to `true` and that the pod uses the `workflow-controller` [Priority Class](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) included in the installation manifests. +If you run a single replica of the Workflow Controller, ensure that the [environment variable](environment-variables.md#controller) `LEADER_ELECTION_DISABLE` is set to `true` and that the Pod uses the `workflow-controller` [Priority Class](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) included in the installation manifests. By disabling the leader election process, you can avoid unnecessary communication with the Kubernetes API, which may become unresponsive when running Workflows at scale. -By using the `PriorityClass`, you can ensure that the Workflow Controller pod is scheduled before other pods in the cluster. +By using the `PriorityClass`, you can ensure that the Workflow Controller Pod is scheduled before other Pods in the cluster. ### Multiple Workflow Controller Replicas It is possible to run multiple replicas of the Workflow Controller to provide high-availability. Ensure that leader election is enabled (either by omitting the `LEADER_ELECTION_DISABLE` or setting it to `false`). -Only one replica of the Workflow Controller will actively manage workflows at any given time. +Only one replica of the Workflow Controller will actively manage Workflows at any given time. The other replicas will be on standby, ready to take over if the active replica fails. -This means that you are guaranteeing resource allocations for replicas that are not actively contributing to the running of workflows. +This means that you are guaranteeing resource allocations for replicas that are not actively contributing to the running of Workflows. The leader election process requires frequent communication with the Kubernetes API. -When running workflows at scale, the Kubernetes API may become unresponsive, causing the leader election to take longer than 10 seconds (`LEADER_ELECTION_RENEW_DEADLINE`) to respond, which will disrupt the controller. - -Even with multiple replicas, a voluntary pod disruption can cause both replicas to be replaced simultaneously. -Use a [Pod Disruption Budget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) to prevent this. +When running Workflows at scale, the Kubernetes API may become unresponsive, causing the leader election to take longer than 10 seconds (`LEADER_ELECTION_RENEW_DEADLINE`) to respond, which will disrupt the controller. ### Considerations A single replica of the Workflow Controller is recommended for most use cases due to: - -- The time taken to re-provision the controller pod often being faster than the time for an existing pod to win a leader election, especially when the cluster is under load. +- The time taken to re-provision the controller Pod often being faster than the time for an existing Pod to win a leader election, especially when the cluster is under load. - Saving on the cost of extra Kubernetes resource allocations that aren't being used. ## Argo Server -Run a minimum of two replicas, typically three, should be run, otherwise it may be possible that API and webhook requests are dropped. - -!!! Tip - Consider [spreading Pods across multiple availability zones](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/). +Run a minimum of two replicas, typically three, to avoid dropping API and webhook requests. From 2c88277633fc90aae9d6cc6379c188e090e9bab6 Mon Sep 17 00:00:00 2001 From: Tim Collins Date: Mon, 10 Feb 2025 16:15:41 +0000 Subject: [PATCH 6/6] make docs Signed-off-by: Tim Collins --- docs/high-availability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/high-availability.md b/docs/high-availability.md index 39ab12f834a5..5f48e098ed8e 100644 --- a/docs/high-availability.md +++ b/docs/high-availability.md @@ -8,7 +8,6 @@ At a minimum, ensure that the Pods are not scheduled on the same node. Use a [Pod Disruption Budget](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) to prevent all replicas from being replaced simultaneously. - ## Workflow Controller In the event of a Workflow Controller Pod failure, the replacement Controller Pod will continue running Workflows when it is created. @@ -35,6 +34,7 @@ When running Workflows at scale, the Kubernetes API may become unresponsive, cau ### Considerations A single replica of the Workflow Controller is recommended for most use cases due to: + - The time taken to re-provision the controller Pod often being faster than the time for an existing Pod to win a leader election, especially when the cluster is under load. - Saving on the cost of extra Kubernetes resource allocations that aren't being used.