Cilium with maglev long delay managing services with several ports #32391
Labels
area/loadbalancing
Impacts load-balancing and Kubernetes service implementations
feature/lb-only
Impacts cilium running in lb-only datapath mode
info-completed
The GH issue has received a reply from the author
kind/bug
This is a bug in the Cilium logic.
kind/community-report
This was reported by a user in the Cilium community, eg via Slack.
needs/triage
This issue requires triaging to establish severity and next steps.
sig/datapath
Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Is there an existing issue for this?
What happened?
Cilium with maglev loadbalancing enabled takes a very long time to update internal structures when there are endpoint changes for services with high number of ports (in the hundreds).
This is reproducible deterministically by creating a simple deployment with 4 replicas behind a Service that defines 400+ ports. It is sufficient to execute a rolling restart of the deployment to have connections going through the service fail for several minutes (while connections directly to the pods are working fine).
This problem disappears if the load balancing algorithm is switched to
random
, in which case the update is instantaneous.Cilium Version
1.15.2
Kernel Version
5.4
Kubernetes Version
1.27
Regression
No response
Sysdump
No response
Relevant log output
Anything else?
The logs above are continuous during the minutes of unreachability of the service and stop right when the service becomes available again. It looks like Cilium is processing one of these updates per port and per IP (so a LoadBalancer service with an external IP and one nodeport allocated per port of the service exponentially worsens this behavior).
It also seems like cilium is processing the same port and ip combination (same id in the json above) over and over again because of the rollout restart. Assuming the 4 pods backing the service before the restart are x.x.x.1, x.x.x.2, x.x.x.3 and x.x.x.4 and after the rollout they are y.y.y.1, y.y.y.2, y.y.y.3, y.y.y.4, it seems cilium processes all the updates in order:
Even though it takes so long that while still processing update 2, the rollout is technically finished (and so it could optimize by directly applying 7), from the logs it seems to appear it still processes some updates of IPs that have rolled off.
I realize that maglev maintains several internal maps to allow the consistent hashing selection, and I imagine that's the reason for the slowdown as it updates them, though several minutes of unavailability are still quite baffling. If this is the expected behavior, and not an unexpected bug, it might be worth documenting it as a downside of maglev for services exposing several hundred ports. Would it also be a potential optimization (maybe not enabled by default) to reduce maglev to maintain maps for consistent selection of backends only per service instead of per port of service? This would mean two connections to the same service from the same source but to two different ports of the service would end up in the same backend, but it would still maintain the consistent hashing benefits and likely strongly reduce this overhead!
Thanks!
Cilium Users Document
Code of Conduct
The text was updated successfully, but these errors were encountered: