-
Notifications
You must be signed in to change notification settings - Fork 173
Description
Hi All,
Please tell me how fence_kubevirt works.
I have built a Pacemaker cluster on OCP-V.
The ACT/STB virtual machines are placed on different Compute nodes, and the virtual machines are stored in storage (NFS) on a storage LAN different from the management LAN.
Step 1) Disconnect the NFS-LAN of the Compute node where the ACT virtual machine is running.
Step 2) STB detects the ACT failure, and fence_kubevirt's STONITH is processed.
On the Web console, the target virtual machine transitions from Stopping to Stopped, and appears to have stopped correctly.
Step 3) However, fence_kubevirt does not succeed because the virtual machine instance remains (Failed in oc get vmi).
[root@rh95-02 ~]# crm_mon -rfA1
Cluster Summary:
* Stack: corosync (Pacemaker is running)
* Current DC: rh95-02 (version 2.1.8-3.el9-3980678f0) - partition with quorum
* Last updated: Thu May 29 12:03:03 2025 on rh95-02
* Last change: Thu May 29 12:01:11 2025 by hacluster via hacluster on rh95-02
* 2 nodes configured
* 10 resource instances configured
Node List:
* Node rh95-01: UNCLEAN (offline)
* Online: [ rh95-02 ]
Full List of Resources:
* Clone Set: pgsql-clone [pgsql] (promotable):
* pgsql (ocf:linuxhajp:pgsql): Unpromoted rh95-01 (UNCLEAN)
* Unpromoted: [ rh95-02 ]
* Resource Group: primary-group:
* ipaddr-primary (ocf:heartbeat:IPaddr2): Started rh95-01 (UNCLEAN)
* ipaddr-replication (ocf:heartbeat:IPaddr2): Started rh95-01 (UNCLEAN)
* Clone Set: ping-clone [ping]:
* Started: [ rh95-02 ]
* Stopped: [ rh95-01 ]
* Clone Set: storage-mon-clone [storage-mon]:
* Started: [ rh95-02 ]
* Stopped: [ rh95-01 ]
* fence1-kubevirt (stonith:fence_kubevirt): Started rh95-02
* fence2-kubevirt (stonith:fence_kubevirt): Stopped
Node Attributes:
* Node: rh95-02:
* master-pgsql : 100
* pgsql-data-status : STREAMING|SYNC
* pgsql-master-baseline : 0000000015000060
* pgsql-status : HS:alone
* ping-status : 1
Migration Summary:
Failed Fencing Actions:
* reboot of rh95-01 by rh95-02 for pacemaker-controld.69801@rh95-02 last failed at 2025-05-29 12:03:03.038906 +09:00
* reboot of rh95-01 by rh95-02 for pacemaker-controld.69801@rh95-02 last failed at 2025-05-29 12:03:01.356906 +09:00
* reboot of rh95-01 by rh95-02 for pacemaker-controld.69801@rh95-02 last failed at 2025-05-29 12:02:59.685906 +09:00
(snip)
Pending Fencing Actions:
* reboot of rh95-01 for pacemaker-controld.69801@rh95-02 last pending
[root@rh95-02 ~]#
[nttossc@dn-ocp-bastion ~]$ oc get vmi
NAME AGE PHASE IP NODENAME READY
ocp-rh95-01 26h Failed 10.130.2.24 dn-ocp-cmp01 False
ocp-rh95-02 26h Running 10.130.0.86 dn-ocp-cmp02 True
ocp-rh95-ping 8d Running 10.131.0.42 dn-ocp-cmp03 True
[nttossc@dn-ocp-bastion ~]$
I understand that the above STONITH failure is as specified (failed because the virtual machine instance remains), but is that correct?
Also, I believe that this control (deletion of virtual machine instance) is managed by the OCP side and cannot be handled by fence_kubevirt.
(snip)
def get_power_status(conn, options):
logging.debug("Starting get status operation")
try:
apiversion = options.get("--apiversion")
namespace = _get_namespace(options)
name = options.get("--plug")
vmi_api = conn.resources.get(api_version=apiversion,
kind='VirtualMachineInstance')
vmi = vmi_api.get(name=name, namespace=namespace)
return translate_status(vmi.status.phase)
except ApiException as e:
if e.status == 404:
try:
vm_api = conn.resources.get(api_version=apiversion, kind='VirtualMachine')
vm = vm_api.get(name=name, namespace=namespace)
except ApiException as e:
logging.error("VM %s doesn't exist", name)
fail(EC_FETCH_VM_UUID)
return "off"
logging.error("Failed to get power status, with API Exception: %s", e)
fail(EC_STATUS)
except Exception as e:
logging.error("Failed to get power status, with Exception: %s", e)
fail(EC_STATUS)
def translate_status(instance_status):
if instance_status == "Running":
return "on"
return "unknown"
(snip)
Best Regards,
Hideo Yamauchi.