[Question] Fence_kubevirt STONITH failure when storage LAN fails.

Hi All,

Please tell me how fence_kubevirt works.

I have built a Pacemaker cluster on OCP-V.
The ACT/STB virtual machines are placed on different Compute nodes, and the virtual machines are stored in storage (NFS) on a storage LAN different from the management LAN.

Step 1) Disconnect the NFS-LAN of the Compute node where the ACT virtual machine is running.

Step 2) STB detects the ACT failure, and fence_kubevirt's STONITH is processed.
On the Web console, the target virtual machine transitions from Stopping to Stopped, and appears to have stopped correctly.

Step 3) However, fence_kubevirt does not succeed because the virtual machine instance remains (Failed in oc get vmi).

```
[root@rh95-02 ~]# crm_mon -rfA1
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: rh95-02 (version 2.1.8-3.el9-3980678f0) - partition with quorum
  * Last updated: Thu May 29 12:03:03 2025 on rh95-02
  * Last change:  Thu May 29 12:01:11 2025 by hacluster via hacluster on rh95-02
  * 2 nodes configured
  * 10 resource instances configured

Node List:
  * Node rh95-01: UNCLEAN (offline)
  * Online: [ rh95-02 ]

Full List of Resources:
  * Clone Set: pgsql-clone [pgsql] (promotable):
    * pgsql     (ocf:linuxhajp:pgsql):   Unpromoted rh95-01 (UNCLEAN)
    * Unpromoted: [ rh95-02 ]
  * Resource Group: primary-group:
    * ipaddr-primary    (ocf:heartbeat:IPaddr2):         Started rh95-01 (UNCLEAN)
    * ipaddr-replication        (ocf:heartbeat:IPaddr2):         Started rh95-01 (UNCLEAN)
  * Clone Set: ping-clone [ping]:
    * Started: [ rh95-02 ]
    * Stopped: [ rh95-01 ]
  * Clone Set: storage-mon-clone [storage-mon]:
    * Started: [ rh95-02 ]
    * Stopped: [ rh95-01 ]
  * fence1-kubevirt     (stonith:fence_kubevirt):        Started rh95-02
  * fence2-kubevirt     (stonith:fence_kubevirt):        Stopped

Node Attributes:
  * Node: rh95-02:
    * master-pgsql                      : 100
    * pgsql-data-status                 : STREAMING|SYNC
    * pgsql-master-baseline             : 0000000015000060
    * pgsql-status                      : HS:alone
    * ping-status                       : 1

Migration Summary:

Failed Fencing Actions:
  * reboot of rh95-01 by rh95-02 for pacemaker-controld.69801@rh95-02 last failed at 2025-05-29 12:03:03.038906 +09:00
  * reboot of rh95-01 by rh95-02 for pacemaker-controld.69801@rh95-02 last failed at 2025-05-29 12:03:01.356906 +09:00
  * reboot of rh95-01 by rh95-02 for pacemaker-controld.69801@rh95-02 last failed at 2025-05-29 12:02:59.685906 +09:00
(snip)
Pending Fencing Actions:
  * reboot of rh95-01 for pacemaker-controld.69801@rh95-02 last pending
[root@rh95-02 ~]#

[nttossc@dn-ocp-bastion ~]$ oc get vmi
NAME            AGE   PHASE     IP            NODENAME       READY
ocp-rh95-01     26h   Failed    10.130.2.24   dn-ocp-cmp01   False
ocp-rh95-02     26h   Running   10.130.0.86   dn-ocp-cmp02   True
ocp-rh95-ping   8d    Running   10.131.0.42   dn-ocp-cmp03   True
[nttossc@dn-ocp-bastion ~]$
```

I understand that the above STONITH failure is as specified (failed because the virtual machine instance remains), but is that correct?
Also, I believe that this control (deletion of virtual machine instance) is managed by the OCP side and cannot be handled by fence_kubevirt.

```
(snip)
def get_power_status(conn, options):
    logging.debug("Starting get status operation")
    try:
        apiversion = options.get("--apiversion")
        namespace = _get_namespace(options)
        name = options.get("--plug")
        vmi_api = conn.resources.get(api_version=apiversion,
                                              kind='VirtualMachineInstance')
        vmi = vmi_api.get(name=name, namespace=namespace)
        return translate_status(vmi.status.phase)
    except ApiException as e:
        if e.status == 404:
            try:
                vm_api = conn.resources.get(api_version=apiversion, kind='VirtualMachine')
                vm = vm_api.get(name=name, namespace=namespace)
            except ApiException as e:
                logging.error("VM %s doesn't exist", name)
                fail(EC_FETCH_VM_UUID)
            return "off"
        logging.error("Failed to get power status, with API Exception: %s", e)
        fail(EC_STATUS)
    except Exception as e:
        logging.error("Failed to get power status, with Exception: %s", e)
        fail(EC_STATUS)

def translate_status(instance_status):
    if instance_status == "Running":
        return "on"
    return "unknown"
(snip)
```

Best Regards,
Hideo Yamauchi.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Fence_kubevirt STONITH failure when storage LAN fails. #629

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Fence_kubevirt STONITH failure when storage LAN fails. #629

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions