Fence Agents Remediation (FAR)

fence-agents-remediation (FAR) is a Kubernetes operator that uses well-known agents to fence and remediate unhealthy nodes. The remediation includes remediating the unhealthy node using a fence agent with an action such as reboot or off, and then evicting workloads from the unhealthy node. The operator is recommended when a node becomes unhealthy, and we want remediate it by completely isolating the node from a cluster and help with recovering its workload. Isolation is needed, since we can’t “trust” the unhealthy node, to prevent it from accessing the shared resources like RWO volumes, and recovering the workloads helps to accelerate and keep their running time.

FAR is one of the remediator operators by Medik8s, such as Self Node Remediation and Machine Deletion Remediation, that were designed to run with the Node HealthCheck Operator (NHC) which detects an unhealthy node and creates remediation Custom Resource (CR). It is recommended to use FAR with NHC for an easier and smoother experience by fully automating the remediation process, but it can be used as a standalone remediator for the more experienced user. Moreover, like other Medik8s operators FAR was generated using the operator-sdk, and it supports Operator Lifecycle Manager (OLM).

About Fence Agents

FAR uses a fence agent to fence a Kubernetes node. Generally, fencing is the process of taking unresponsive/unhealthy computers into a safe state and isolating the computer. Fence agent is a software "driver" which is able to prevent nodes from destroying data on shared storage, and it aimed for isolating corrupted nodes. The isolation with FAR is mostly power-based fencing which enables power-cycling, resetting, or turning off the computer.

FAR uses some of the fence agents from the upstream repository by the ClusterLabs group. For example, fence_ipmilan for Intelligent Platform Management Interface (IPMI) environments or fence_aws for Amazon Web Services (AWS) platform. These upstream fence agents are Python scripts that are used to isolate a corrupted node from the rest of the cluster in a power-based fencing method. When a node is switched off, it cannot corrupt any data on shared storage. The fence agents use command-line arguments rather than configuration files, and to understand better the parameters you can view the fence agent's metadata (e.g., fence_ipmilan -o metadata).

FAR supports an --action parameter, which defines how the node should be remediated. FAR supports both reboot and off actions:

reboot: Restarts the node, allowing automatic recovery.
off: Powers off the node, allowing manual recovery.

Advantages

Robustness - FAR has direct feedback from the agent's management Application Programming Interface (API) call (e.g., IPMI) about the result of the fence action without using the Kubernetes API.
Speed - FAR is rapid since it can reboot a node and receive an acknowledgment from the API call while other remediators might need to wait a safe time till they can expect the node to be rebooted.
Availability - FAR has high availability by running with two replicas of its pod, and when the leader of these two pods is evicted, then the other one takes control and reduces FAR downtime.
Diversity - FAR includes several fence agents from a large known set of upstream fencing agents for bare metal servers, virtual machines, cloud platforms, etc.
Adjustability - FAR allows to set up different parameters for running the API call that remediates the node.
Flexibility – FAR supports multiple remediation actions (reboot and off), allowing users to choose between faster recovery or stronger isolation depending on their cluster needs.

How does FAR work?

The operator watches for new or deleted CRs called FenceAgentsRemediation (or far) which trigger remediation for the node, based on the CR's name. When the CR name doesn't match a node in the cluster, then the CR won't trigger any remediation by FAR. Remediation consists of 3 steps: cordoning the node, fencing the node, and accelerating rescheduling of the workloads:

Cordoning the Node

FAR adds a unique taint, remediation.medik8s.io/fence-agents-remediation:NoSchedule, with NoSchedule effect, to prevent scheduling of pods on the unhealthy node as long as the taint remains (the taint is removed on far CR deletion), unless the pods can tolerate it.

Fencing the Node

FAR remediates by executing a fence agent action, reboot or off, on the unhealthy node, and safely reschedules any remaining workloads to other nodes, so they can continue running and be isolated from the unhealthy node.

Reboot-based Remediation

(default behavior) If you choose the reboot action, then FAR reboots the unhealthy node by executing a fence agent.

Power-off-based Remediation

If you choose the off action, then FAR powers off the unhealthy node using the fence agent. Unlike reboot, the node does not automatically return to service. It remains powered off until manually recovered by an administrator.

Accelerating Rescheduling of Workloads

After the fence agent was executed, then the operator will try to accelerate Kubernetes rescheduling of remaining workloads (mostly for stateful pods) by one of two remediation strategies (defined by remediationStrategy field): ResourceDeletion or OutOfServiceTaint.

ResourceDeletion

The default strategy which forcefully deletes workloads

OutOfServiceTaint

Append the well known taint, "node.kubernetes.io/out-of-service=nodeshutdown:NoExecute" which is supported for clusters with k8s version 1.26+ or OCP/OKD version 4.13+. This taint will result in forcefully deletion of workloads that don't tolerate the out-of-service taint and then immediately detaching volume operation for such pods.

FenceAgentsRemediationTemplate (fartemplate)

FAR includes the FenceAgentsRemediationTemplate (or fartemplate) Custom Resource Definition (CRD) for how to create a FenceAgentsRemediation CR. The template has the same fields as far CR (e.g., agent name) and it is used for automatically creating remediation CR by another operator/mechanism (e.g., NHC). The other operator is responsible of creating (and eventually deleting) the FenceAgentsRemediation CR with the name of the unhealthy node, even though FAR can be used manually without fartemplate and an additional operator (see standalone FAR).

Operator Workflow

Prerequisites

FAR and NHC are installed on the cluster.
One of the nodes fails (it has become unhealthy), and NHC detects this node as unhealthy and decides to create a remediation CR, FenceAgentsRemediation CR, based on the external remediator template (e.g., fartemplate).

Workflow

FAR adds NoSchedule taint to the failed node => Prevent scheduling of new workloads on the failed node (including any workloads that were on the node before rebooting or powering off the failed node).
FAR executes the configured fence agent action on the failed node => Depending on the action (reboot or off), the node is either restarted or powered off. => After the action, there are no workloads in the failed node
FAR forcefully deletes the pods in the failed node => The scheduler understands that it can schedule the failed pods on a different node
After the failed node becomes healthy, NHC deletes FenceAgentsRemediation CR, the NoSchedule taint from Step 1 is removed, and the node becomes schedulable again.

FenceAgentsRemediation CR Status

The FenceAgentsRemediation CR status includes three conditions: Processing, FenceAgentActionSucceeded, and Succeeded. Each condition has a status (true/false/unknown), a message, and a reason which indicates the state of the condition until it is met. Using these conditions we can understand better the state of the CR, and if an error occurred. For example, see the below FenceAgentsRemediation CR status and the conditions state for a successful remediation.

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediation
metadata:
  name: NODE_NAME
spec: 
.
.
.
status:
  conditions:
    - type: Processing
      message: >-
        The unhealthy node was fully remediated (it was tainted, fenced using
        the fence agent and all the node resources have been deleted)
      reason: RemediationFinishedSuccessfully
      status: 'False'
    - type: FenceAgentActionSucceeded
      message: >-
        FAR taint was added and the fence agent command has been created and
        executed successfully
      reason: FenceAgentSucceeded
      status: 'True'   
    - type: Succeeded
      message: >-
        The unhealthy node was fully remediated (it was tainted, fenced using
        the fence agent and all the node resources have been deleted)
      reason: RemediationFinishedSuccessfully
      status: 'True'
  lastUpdateTime: '2024-01-30T10:49:46Z'

FAR Remediation Events

The operator emits remediation events on the node and the remediation CR for better understanding of the remediation process. Some important remediation events are FenceAgentSucceeded, and RemediationFinished which signifies that the fence agent command was succeeded and that the remediation was completed. All the remediation events of FAR (as well as other Medik8s operators) has a message that begins with [remediation]. Therefore, to easily filter these events run oc get events -A | awk '/\[remediation\]/ || NR==1' to get any remediation event or oc get events -A | awk '/\[remediation\]/ && /worker-1/ || NR==1' for getting any remediation event for node and CR of name worker-1.

Installation

There are three ways to install the operator:

Deploy the latest version, which was built from the main branch, to a running Kubernetes/OpenShift cluster.
Deploy the latest release version from the Kubernetes community, OperatorHub.io, to a running Kubernetes cluster.
Build and deploy from sources to a running Kubernetes/OpenShift cluster.

Deploy the latest version

After every PR is merged to the main branch, then the images are built and pushed to quay.io (due to the post-submit job ). For deployment of FAR using these images you need:

Install operator-sdk binary from their official website.
A running Kubernetes cluster, or an OpenShift (OCP) cluster with OLM installed. To install it on Kubernetes cluster run operator-sdk olm install.
A valid $KUBECONFIG is configured to access your cluster.
Run operator-sdk run bundle quay.io/medik8s/fence-agents-remediation-operator-bundle:latest to deploy the FAR's latest version on the current namespace. Another way to achieve that is running BUNDLE_RUN_NAMESPACE=<INSTALLED_NAMESPACE> make bundle-run to install FAR on <INSTALLED_NAMESPACE> namespace.

Note: Installing FAR on a new namesapce (e.g., ns) requires setting some labels on the namespace prior to installing FAR:
kubectl label --overwrite ns olm security.openshift.io/scc.podSecurityLabelSync=false
kubectl label --overwrite ns olm pod-security.kubernetes.io/enforce=privileged

Deploy from the Kubernetes community

Go to OperatorHub, click on Install, and follow the instructions on how to install the operator on Kubernetes.

Build and deploy from sources

Clone FAR repository.
Follow OLM's instructions on how to configure the operator's image registry (build and push the operator container).
Run FAR in your cluster using its bundle container (similar to the above installation, and also see OLM's instructions).

Usage

FAR is recommended for use with NHC to automate high availability for unhealthy nodes since NHC detects unhealthy nodes and it can create an external remediation CR, e.g., FenceAgentsRemediation CR, for unhealthy nodes. This automated way gives the responsibility on FenceAgentsRemediation CRs (creation and deletion) to NHC, even though FAR can also act as a standalone remediator, but it comes with the expense from the advanced administrator to identify the nodes' health for creating (and eventually) deleting these CRs.

Either way, a user must be familiar with the fence agent to be used. Know the fence agent parameters, and any other requirements on the cluster (e.g., fence_ipmilan needs machines that support IPMI).

FAR with NHC

Install NHC, and FAR using one of the above options (Installation).
Create the fartemplate CR (see below example).
Create a NodeHealthCheck CR that uses fartemplate as its external remediator in RemediationTemplate or EscalatingRemediations.

Example FenceAgentsRemediationTemplate CR

The fartemplate CR is created by the administrator, and NHC can use it for creating a remediation CR, e.g. FenceAgentsRemediation. For a better understanding please see the below example of a dummy fartemplate object:

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediationTemplate
metadata:
  name: fenceagentsremediationtemplate-default
  namespace: default
spec:
  template: {}

Note: FenceAgentsRemediationTemplate CR must be created in the same namespace that the FAR operator has been installed.

Configuring NodeHealthCheck to use the example fenceagentsremediationtemplate-default template above.

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: nodehealthcheck-sample
spec:
  remediationTemplate:
    apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
    kind: FenceAgentsRemediationTemplate
    name: fenceagentsremediationtemplate-default
    namespace: default

NHC creates FenceAgentsRemediation CR using fartemplate after it detects an unhealthy node (according to NHC's unhealthy conditions). FenceAgentsRemediation CRs are deleted by NHC after it detects the node is healthy again.

Standalone FAR

Install FAR using one of the above options (Installation).
Create FenceAgentsRemediation CR with the name of the node to be remediated, the fence agent name, and its parameters.

Example FenceAgentsRemediation CR

The FAR CR, FenceAgentsRemediation, is created by the admin and is used to trigger the fence agent on a specific node. The CR includes the following parameters:

agent - fence agent name. File name which is validated (by kubebuilder and Webhook) against a list of supported agents in the FAR pod.
sharedparameters - cluster wide parameters for executing the fence agent.
nodeparameters - node specific parameters for executing the fence agent.
retrycount - number of times to retry the fence agent in case of failure. The default is 5.
retryinterval - interval between retries in seconds. The default is "5s".
timeout - timeout for the fence agent in seconds. The default is "60s".
remediationStrategy - either OutOfServiceTaint or ResourceDeletion:
- OutOfServiceTaint: This remediation strategy implicitly causes the deletion of the pods and the detachment of the associated volumes on the node. It achieves this by placing the OutOfServiceTaint taint on the node.
- ResourceDeletion: This remediation strategy deletes the pods on the node.
sharedSecretName - the name of the Secret containing cluster-wide parameters. Defaults to "fence-agents-credentials-shared", but can be overridden by the user.
nodeSecretNames - is mapping the node name to the Secret name which contains params relevant for that node.

The FenceAgentsRemediation CR is created by the administrator and is used to trigger the fence agent on a specific node. The CR includes an agent field for the fence agent name, sharedparameters field with all the shared, not specific to a node, parameters, and a nodeparameters field to specify the parameters for the fenced node. For better understanding please see the below example of FenceAgentsRemediation CR for node worker-1 (see it also as the sample FAR):

apiVersion: fence-agents-remediation.medik8s.io/v1alpha1
kind: FenceAgentsRemediation
metadata:
  name: worker-1
spec:
  agent: fence_ipmilan
  retrycount: 5
  retryinterval: "5s"
  timeout: "60s"
  sharedparameters:
    --username: "admin"
    --lanplus: ""
    # You can set --action to either "reboot" or "off" depending on your remediation strategy
    --action: "reboot"
    --ip: "192.168.111.1"
  nodeparameters:
    --ipport:
      master-0: "6230"
      master-1: "6231"
      master-2: "6232"
      worker-0: "6233"
      worker-1: "6234"
      worker-2: "6235"
  remediationStrategy: OutOfServiceTaint
  sharedSecretName: fence-agents-credentials-shared
  nodeSecretNames:
    worker-0: fence-agents-credentials-worker0
    worker-1: fence-agents-credentials-worker1
    master-0: fence-agents-credentials-master0

Secret Support:

You can define:
- A shared Secret containing parameters used across all nodes.
- Node-specific Secrets, use nodeSecretNames to define a mapping between each node to the secret it should consume.
If a parameter exists in both a shared and a node Secret, the node Secret value takes precedence.

If a parameter is defined in both a Secret and in the sharedparameters or nodeparameters fields of the CR, a validation error will occur to prevent ambiguity.

NodeName Template Support:

Both sharedparameters and shared Secret parameters support Go template syntax for dynamic node name substitution:

Use {{.NodeName}} in parameter values to substitute the actual node name at runtime.
Templates are processed for both shared parameters and shared/node secret parameters.
Invalid template syntax will prevent fence agent execution and log appropriate error messages.

Examples:

For shared parameters:

spec:
  sharedparameters:
    --systems-uri: "/redfish/v1/Systems/{{.NodeName}}"
    --hostname: "{{.NodeName}}.example.com"

For shared secrets:

apiVersion: v1
kind: Secret
metadata:
  name: fence-agents-credentials-shared
  namespace: openshift-workload-availability
type: Opaque
data:
  --systems-uri: L3JlZGZpc2gvdjEvU3lzdGVtcy97ey5Ob2RlTmFtZX19  # "/redfish/v1/Systems/{{.NodeName}}" base64 encoded
  --hostname: e3suTm9kZU5hbWV9fS5leGFtcGxlLmNvbQ==  # "{{.NodeName}}.example.com" base64 encoded
  --password: eXl5eQ==  # "yyyy" base64 encoded

When a FenceAgentsRemediation is processed for node worker-1, the templates will be resolved to:

--systems-uri=/redfish/v1/Systems/worker-1
--hostname=worker-1.example.com

Here is an example for a Secret without templates:

apiVersion: v1
kind: Secret
metadata:
name: fence-agents-credentials-worker0
namespace: openshift-workload-availability
type: Opaque
data:
  --password: eXl5eQ== # "yyyy" base64 encoded

Tests

Run code checks and unit tests

Run make test

Run e2e tests

Deploy the operator as explained above
(Only for AWS platforms) Run make ocp-aws-credentials to add sufficient CredentialsRequest.
Export the operator installed namespace (e.g., openshift-workload-availability) before running the e2e test: export OPERATOR_NS=openshift-workload-availability && make test-e2e

Run Scorecard tests

Run make test-scorecard on a running Kubernetes cluster to statically validate the operator bundle directory using Scorecard.

Troubleshooting

Watch the FenceAgentsRemediation CR status conditions value, message, and reason for better understanding whether the fence agent action succeeded and the remediation completed.
Watch for the emitted remediation events at FenceAgentsRemediation CR or the remediated node for easier identification of the remediation process.
Investigate FAR’s pod logs in the container manager (kubectl logs -n <INSTALLED_NAMESPACE> --selector='app.kubernetes.io/name=fence-agents-remediation-operator' -c manager).
Use Medik8s's team must-gather (for OCP only) by running oc adm must-gather --image=quay.io/medik8s/must-gather. It collects some related debug information for FAR and the rest of the Medik8s team operators.

Help

Feel free to join our Google group to get more info - Medik8s Google Group.

Name		Name	Last commit message	Last commit date
Latest commit History 615 Commits
.github/workflows		.github/workflows
api/v1alpha1		api/v1alpha1
bundle		bundle
config		config
controllers		controllers
hack		hack
pkg		pkg
test/e2e		test/e2e
vendor		vendor
version		version
.ci-operator.yaml		.ci-operator.yaml
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
OWNERS_ALIASES		OWNERS_ALIASES
PROJECT		PROJECT
README.md		README.md
bundle.Dockerfile		bundle.Dockerfile
go.mod		go.mod
go.sum		go.sum
main.go		main.go

License

medik8s/fence-agents-remediation

Folders and files

Latest commit

History

Repository files navigation

Fence Agents Remediation (FAR)

About Fence Agents

Advantages

How does FAR work?

Cordoning the Node

Fencing the Node

Reboot-based Remediation

Power-off-based Remediation

Accelerating Rescheduling of Workloads

ResourceDeletion

OutOfServiceTaint

FenceAgentsRemediationTemplate (fartemplate)

Operator Workflow

Prerequisites

Workflow

FenceAgentsRemediation CR Status

FAR Remediation Events

Installation

Deploy the latest version

Deploy from the Kubernetes community

Build and deploy from sources

Usage

FAR with NHC

Example FenceAgentsRemediationTemplate CR

Standalone FAR

Example FenceAgentsRemediation CR

Secret Support:

NodeName Template Support:

Tests

Run code checks and unit tests

Run e2e tests

Run Scorecard tests

Troubleshooting

Help

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors 12

Uh oh!

Languages

Packages