Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCI Compute Instance: Instance Principal Authentication Failure Triggering Pacemaker VIP Failover #2020

Open
9roomdive opened this issue Feb 4, 2025 · 0 comments

Comments

@9roomdive
Copy link

OCI Compute Instance: Instance Principal Authentication Failure Triggering Pacemaker VIP Failover

  1. Issue Description
    We are operating a Pacemaker-based Virtual IP (VIP) failover configuration on OCI Compute instances.
    During the VIP management process, Instance Principal authentication intermittently fails, triggering an unintended VIP failover.
    This issue occurs when Pacemaker periodically monitors the VIP status and detects a "not running" state due to an authentication failure, leading to an unnecessary failover event.

  2. Environment Details
    Cloud Infrastructure: OCI (Oracle Cloud Infrastructure)
    OCI Compute Instance OS: Red Hat Enterprise Linux 8.10
    Pacemaker Version: (To be provided if needed)
    OCI CLI Version: (To be provided if needed)
    OCI IAM Policies: No specific issues observed
    OCI Dynamic Group Configuration: No specific issues observed

  3. Error Logs
    At the time of failover, Pacemaker logs indicate Instance Principal authentication failure:

Jan 29 19:51:38 sdildgsmdbp1 pacemaker-controld[2763401]: notice: Result of monitor operation for OCIVIP on sdildgsmdbp1: not running
Jan 29 19:51:38 sdildgsmdbp1 pacemaker-controld[2763401]: notice: OCIVIP_monitor_60000@sdildgsmdbp1 output [ ERROR: Failed retrieving certificates from localhost.
Instance principal auth is only possible from OCI compute instances.
Exception: {'status': 401, 'code': 'NotAuthenticated', 'message': 'The required information to complete authentication was not provided or was incorrect.'}
4. Self-Analysis of the Issue
Pacemaker's monitor operation uses OCI CLI to check the VIP status, but Instance Principal authentication intermittently fails.
Running oci os ns get returns a 401 NotAuthenticated error.
Pacemaker detects the VIP resource as "not running" and triggers a failover event.
After the failover, OCI CLI authentication resumes normal operation.
Based on this, we suspect that the intermittent failure of Instance Principal authentication causes Pacemaker to incorrectly identify a failure, leading to an unnecessary failover event.

  1. Troubleshooting Steps Taken
    1️⃣ Verified OCI CLI authentication status

Executed oci os ns get as the root user running Pacemaker.
✅ Result: Authentication failed (401 NotAuthenticated).
2️⃣ Manually set the environment variable (OCI_CLI_AUTH=instance_principal) and rechecked authentication

Set OCI_CLI_AUTH=instance_principal and reran oci os ns get.
✅ Result: Authentication succeeded.
6. Additional Investigations & Questions
We seek to determine the root cause of this intermittent Instance Principal authentication failure within Pacemaker.
To further analyze the issue, we would like clarification on the following:

Is it possible that Instance Principal authentication can intermittently fail on OCI Compute Instances?

Could the Instance Metadata Service (http://169.254.169.254/opc/v2/instance/) experience delays or temporary failures in providing authentication details?
Is there any known issue with OCI's internal services that provide Instance Principal authentication?
Can OCI's back-end services confirm if there were authentication failures for this Instance Principal?

Are there OCI internal logs that can verify whether authentication failures occurred for this specific instance?
Do the authentication failures correlate with specific time periods or potential network-related issues?
What are the recommended best practices to ensure stable authentication when OCI CLI is executed within Pacemaker?

Are there any OCI-recommended configurations for HA (High Availability) environments like Pacemaker to prevent Instance Principal authentication failures?
Can OCI CLI's Instance Principal authentication be cached or retried to prevent temporary failures from triggering failovers?
We would appreciate any insights from OCI regarding this issue.
Are there additional factors we should investigate?

##About the resources Pacemaker will use to manage OCI Floating IP by default
https://docs.oracle.com/en/learn/oci-ip-failover/index.html#task-2-configure-the-cluster-and-the-floating-ip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant