Skip to content

[Question] Fence_kubevirt STONITH failure when storage LAN fails. #629

@HideoYamauchi

Description

@HideoYamauchi

Hi All,

Please tell me how fence_kubevirt works.

I have built a Pacemaker cluster on OCP-V.
The ACT/STB virtual machines are placed on different Compute nodes, and the virtual machines are stored in storage (NFS) on a storage LAN different from the management LAN.

Step 1) Disconnect the NFS-LAN of the Compute node where the ACT virtual machine is running.

Step 2) STB detects the ACT failure, and fence_kubevirt's STONITH is processed.
On the Web console, the target virtual machine transitions from Stopping to Stopped, and appears to have stopped correctly.

Step 3) However, fence_kubevirt does not succeed because the virtual machine instance remains (Failed in oc get vmi).

[root@rh95-02 ~]# crm_mon -rfA1
Cluster Summary:
  * Stack: corosync (Pacemaker is running)
  * Current DC: rh95-02 (version 2.1.8-3.el9-3980678f0) - partition with quorum
  * Last updated: Thu May 29 12:03:03 2025 on rh95-02
  * Last change:  Thu May 29 12:01:11 2025 by hacluster via hacluster on rh95-02
  * 2 nodes configured
  * 10 resource instances configured

Node List:
  * Node rh95-01: UNCLEAN (offline)
  * Online: [ rh95-02 ]

Full List of Resources:
  * Clone Set: pgsql-clone [pgsql] (promotable):
    * pgsql     (ocf:linuxhajp:pgsql):   Unpromoted rh95-01 (UNCLEAN)
    * Unpromoted: [ rh95-02 ]
  * Resource Group: primary-group:
    * ipaddr-primary    (ocf:heartbeat:IPaddr2):         Started rh95-01 (UNCLEAN)
    * ipaddr-replication        (ocf:heartbeat:IPaddr2):         Started rh95-01 (UNCLEAN)
  * Clone Set: ping-clone [ping]:
    * Started: [ rh95-02 ]
    * Stopped: [ rh95-01 ]
  * Clone Set: storage-mon-clone [storage-mon]:
    * Started: [ rh95-02 ]
    * Stopped: [ rh95-01 ]
  * fence1-kubevirt     (stonith:fence_kubevirt):        Started rh95-02
  * fence2-kubevirt     (stonith:fence_kubevirt):        Stopped

Node Attributes:
  * Node: rh95-02:
    * master-pgsql                      : 100
    * pgsql-data-status                 : STREAMING|SYNC
    * pgsql-master-baseline             : 0000000015000060
    * pgsql-status                      : HS:alone
    * ping-status                       : 1

Migration Summary:

Failed Fencing Actions:
  * reboot of rh95-01 by rh95-02 for pacemaker-controld.69801@rh95-02 last failed at 2025-05-29 12:03:03.038906 +09:00
  * reboot of rh95-01 by rh95-02 for pacemaker-controld.69801@rh95-02 last failed at 2025-05-29 12:03:01.356906 +09:00
  * reboot of rh95-01 by rh95-02 for pacemaker-controld.69801@rh95-02 last failed at 2025-05-29 12:02:59.685906 +09:00
(snip)
Pending Fencing Actions:
  * reboot of rh95-01 for pacemaker-controld.69801@rh95-02 last pending
[root@rh95-02 ~]#

[nttossc@dn-ocp-bastion ~]$ oc get vmi
NAME            AGE   PHASE     IP            NODENAME       READY
ocp-rh95-01     26h   Failed    10.130.2.24   dn-ocp-cmp01   False
ocp-rh95-02     26h   Running   10.130.0.86   dn-ocp-cmp02   True
ocp-rh95-ping   8d    Running   10.131.0.42   dn-ocp-cmp03   True
[nttossc@dn-ocp-bastion ~]$

I understand that the above STONITH failure is as specified (failed because the virtual machine instance remains), but is that correct?
Also, I believe that this control (deletion of virtual machine instance) is managed by the OCP side and cannot be handled by fence_kubevirt.

(snip)
def get_power_status(conn, options):
    logging.debug("Starting get status operation")
    try:
        apiversion = options.get("--apiversion")
        namespace = _get_namespace(options)
        name = options.get("--plug")
        vmi_api = conn.resources.get(api_version=apiversion,
                                              kind='VirtualMachineInstance')
        vmi = vmi_api.get(name=name, namespace=namespace)
        return translate_status(vmi.status.phase)
    except ApiException as e:
        if e.status == 404:
            try:
                vm_api = conn.resources.get(api_version=apiversion, kind='VirtualMachine')
                vm = vm_api.get(name=name, namespace=namespace)
            except ApiException as e:
                logging.error("VM %s doesn't exist", name)
                fail(EC_FETCH_VM_UUID)
            return "off"
        logging.error("Failed to get power status, with API Exception: %s", e)
        fail(EC_STATUS)
    except Exception as e:
        logging.error("Failed to get power status, with Exception: %s", e)
        fail(EC_STATUS)

def translate_status(instance_status):
    if instance_status == "Running":
        return "on"
    return "unknown"
(snip)

Best Regards,
Hideo Yamauchi.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions