OCI Compute Instance: Instance Principal Authentication Failure Triggering Pacemaker VIP Failover #2020

9roomdive · 2025-02-04T03:48:25Z

OCI Compute Instance: Instance Principal Authentication Failure Triggering Pacemaker VIP Failover

Issue Description
We are operating a Pacemaker-based Virtual IP (VIP) failover configuration on OCI Compute instances.
During the VIP management process, Instance Principal authentication intermittently fails, triggering an unintended VIP failover.
This issue occurs when Pacemaker periodically monitors the VIP status and detects a "not running" state due to an authentication failure, leading to an unnecessary failover event.
Environment Details
Cloud Infrastructure: OCI (Oracle Cloud Infrastructure)
OCI Compute Instance OS: Red Hat Enterprise Linux 8.10
Pacemaker Version: (To be provided if needed)
OCI CLI Version: (To be provided if needed)
OCI IAM Policies: No specific issues observed
OCI Dynamic Group Configuration: No specific issues observed
Error Logs
At the time of failover, Pacemaker logs indicate Instance Principal authentication failure:

Jan 29 19:51:38 sdildgsmdbp1 pacemaker-controld[2763401]: notice: Result of monitor operation for OCIVIP on sdildgsmdbp1: not running
Jan 29 19:51:38 sdildgsmdbp1 pacemaker-controld[2763401]: notice: OCIVIP_monitor_60000@sdildgsmdbp1 output [ ERROR: Failed retrieving certificates from localhost.
Instance principal auth is only possible from OCI compute instances.
Exception: {'status': 401, 'code': 'NotAuthenticated', 'message': 'The required information to complete authentication was not provided or was incorrect.'}
4. Self-Analysis of the Issue
Pacemaker's monitor operation uses OCI CLI to check the VIP status, but Instance Principal authentication intermittently fails.
Running oci os ns get returns a 401 NotAuthenticated error.
Pacemaker detects the VIP resource as "not running" and triggers a failover event.
After the failover, OCI CLI authentication resumes normal operation.
Based on this, we suspect that the intermittent failure of Instance Principal authentication causes Pacemaker to incorrectly identify a failure, leading to an unnecessary failover event.

Troubleshooting Steps Taken
1️⃣ Verified OCI CLI authentication status

Executed oci os ns get as the root user running Pacemaker.
✅ Result: Authentication failed (401 NotAuthenticated).
2️⃣ Manually set the environment variable (OCI_CLI_AUTH=instance_principal) and rechecked authentication

Set OCI_CLI_AUTH=instance_principal and reran oci os ns get.
✅ Result: Authentication succeeded.
6. Additional Investigations & Questions
We seek to determine the root cause of this intermittent Instance Principal authentication failure within Pacemaker.
To further analyze the issue, we would like clarification on the following:

Is it possible that Instance Principal authentication can intermittently fail on OCI Compute Instances?

Could the Instance Metadata Service (http://169.254.169.254/opc/v2/instance/) experience delays or temporary failures in providing authentication details?
Is there any known issue with OCI's internal services that provide Instance Principal authentication?
Can OCI's back-end services confirm if there were authentication failures for this Instance Principal?

Are there OCI internal logs that can verify whether authentication failures occurred for this specific instance?
Do the authentication failures correlate with specific time periods or potential network-related issues?
What are the recommended best practices to ensure stable authentication when OCI CLI is executed within Pacemaker?

Are there any OCI-recommended configurations for HA (High Availability) environments like Pacemaker to prevent Instance Principal authentication failures?
Can OCI CLI's Instance Principal authentication be cached or retried to prevent temporary failures from triggering failovers?
We would appreciate any insights from OCI regarding this issue.
Are there additional factors we should investigate?

##About the resources Pacemaker will use to manage OCI Floating IP by default
https://docs.oracle.com/en/learn/oci-ip-failover/index.html#task-2-configure-the-cluster-and-the-floating-ip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCI Compute Instance: Instance Principal Authentication Failure Triggering Pacemaker VIP Failover #2020

OCI Compute Instance: Instance Principal Authentication Failure Triggering Pacemaker VIP Failover #2020

9roomdive commented Feb 4, 2025

OCI Compute Instance: Instance Principal Authentication Failure Triggering Pacemaker VIP Failover #2020

OCI Compute Instance: Instance Principal Authentication Failure Triggering Pacemaker VIP Failover #2020

Comments

9roomdive commented Feb 4, 2025