-
Notifications
You must be signed in to change notification settings - Fork 198
Description
What happened:
I am experiencing intermittent, one-way cross-cluster connectivity failures in EKS multi-cluster environment. Connection to specific pods fails (times out), while other pods on the same node remain reachable.
The root cause is a Policy-Based Routing (PBR) integration failure between the Submariner Route Agent and the AWS CNI.
- The AWS CNI creates dedicated PBR rules and custom routing tables (e.g.,
table 2) for pods assigned IPs from secondary ENIs (e.g.,ens6). - The Submariner Route Agent correctly populates the
mainrouting table with the necessaryvx-submarinerroutes for remote clusters. - However, the Route Agent fails to detect or populate these custom, CNI-created tables (e.g.,
table 2). - As a result, any egress traffic (including replies to
pingortraceroute) from a pod forced to use this custom table is not routed to thevx-submarinertunnel. Instead, it follows the table's default route (the VPC gateway viaens6), causing the packet to be blackholed.
Recreating the failing pod temporarily resolves the issue because the PBR rule is torn down, and the new pod instance (by chance) often uses the main table.
What you expected to happen:
The Submariner Route Agent should detect all active routing tables used by routable pods, including custom PBR tables dynamically created by the AWS CNI.
It should ensure that all necessary vx-submariner routes (for remote cluster/service CIDRs) are replicated to all relevant tables (i.e., table main and any custom tables like table 2) to guarantee consistent cross-cluster egress routing for all pods, regardless of which ENI or routing table they use.
How to reproduce it (as minimally and precisely as possible):
- Deploy Submariner in an EKS cluster using AWS CNI in its default PBR-enabled mode.
- Create enough pods on a single node to force the AWS CNI to allocate IPs from a secondary ENI (e.g.,
ens6). - Identify a pod that has been assigned an IP on this secondary ENI.
- Confirm this by logging into the node and finding a specific PBR rule for the pod's IP (e.g.,
ip rule show | grep <pod_ip>). - Verify that the custom table (e.g.,
ip route show table 2) is missing thevx-submarinerroutes, whileip route show table maincontains them. - Attempt to
pingortraceroutethis specific pod from a remote (Submariner-connected) cluster. - Observe the
traceroutefailing (timing out) after reaching the destination cluster's gateway.
Anything else we need to know?:
Environment:
- Subctl version: 0.20
- Cloud provider or hardware configuration: AWS EKS 1.32
Metadata
Metadata
Assignees
Labels
Type
Projects
Status