Skip to content

Submariner Route Agent fails to populate custom PBR tables created by AWS CNI, causing intermittent cross-cluster connectivity loss #3697

@jakkwis

Description

@jakkwis

What happened:

I am experiencing intermittent, one-way cross-cluster connectivity failures in EKS multi-cluster environment. Connection to specific pods fails (times out), while other pods on the same node remain reachable.

The root cause is a Policy-Based Routing (PBR) integration failure between the Submariner Route Agent and the AWS CNI.

  1. The AWS CNI creates dedicated PBR rules and custom routing tables (e.g., table 2) for pods assigned IPs from secondary ENIs (e.g., ens6).
  2. The Submariner Route Agent correctly populates the main routing table with the necessary vx-submariner routes for remote clusters.
  3. However, the Route Agent fails to detect or populate these custom, CNI-created tables (e.g., table 2).
  4. As a result, any egress traffic (including replies to ping or traceroute) from a pod forced to use this custom table is not routed to the vx-submariner tunnel. Instead, it follows the table's default route (the VPC gateway via ens6), causing the packet to be blackholed.

Recreating the failing pod temporarily resolves the issue because the PBR rule is torn down, and the new pod instance (by chance) often uses the main table.

What you expected to happen:

The Submariner Route Agent should detect all active routing tables used by routable pods, including custom PBR tables dynamically created by the AWS CNI.

It should ensure that all necessary vx-submariner routes (for remote cluster/service CIDRs) are replicated to all relevant tables (i.e., table main and any custom tables like table 2) to guarantee consistent cross-cluster egress routing for all pods, regardless of which ENI or routing table they use.

How to reproduce it (as minimally and precisely as possible):

  1. Deploy Submariner in an EKS cluster using AWS CNI in its default PBR-enabled mode.
  2. Create enough pods on a single node to force the AWS CNI to allocate IPs from a secondary ENI (e.g., ens6).
  3. Identify a pod that has been assigned an IP on this secondary ENI.
  4. Confirm this by logging into the node and finding a specific PBR rule for the pod's IP (e.g., ip rule show | grep <pod_ip>).
  5. Verify that the custom table (e.g., ip route show table 2) is missing the vx-submariner routes, while ip route show table main contains them.
  6. Attempt to ping or traceroute this specific pod from a remote (Submariner-connected) cluster.
  7. Observe the traceroute failing (timing out) after reaching the destination cluster's gateway.

Anything else we need to know?:

Environment:

  • Subctl version: 0.20
  • Cloud provider or hardware configuration: AWS EKS 1.32

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedLooking for someone to work on this

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions