Cilium with maglev long delay managing services with several ports #32391

tommasopozzetti · 2024-05-07T02:18:46Z

Is there an existing issue for this?

I have searched the existing issues

What happened?

Cilium with maglev loadbalancing enabled takes a very long time to update internal structures when there are endpoint changes for services with high number of ports (in the hundreds).
This is reproducible deterministically by creating a simple deployment with 4 replicas behind a Service that defines 400+ ports. It is sufficient to execute a rolling restart of the deployment to have connections going through the service fail for several minutes (while connections directly to the pods are working fine).
This problem disappears if the load balancing algorithm is switched to random, in which case the update is instantaneous.

Cilium Version

1.15.2

Kernel Version

5.4

Kubernetes Version

1.27

Regression

No response

Sysdump

No response

Relevant log output

--- Many minutes of continuous lines of the following from cilium-dbg monitor ---
>> Service upserted: {...}
>> Service upserted: {...}
>> Service upserted: {...}

Anything else?

The logs above are continuous during the minutes of unreachability of the service and stop right when the service becomes available again. It looks like Cilium is processing one of these updates per port and per IP (so a LoadBalancer service with an external IP and one nodeport allocated per port of the service exponentially worsens this behavior).
It also seems like cilium is processing the same port and ip combination (same id in the json above) over and over again because of the rollout restart. Assuming the 4 pods backing the service before the restart are x.x.x.1, x.x.x.2, x.x.x.3 and x.x.x.4 and after the rollout they are y.y.y.1, y.y.y.2, y.y.y.3, y.y.y.4, it seems cilium processes all the updates in order:

{x.x.x.1, x.x.x.2, x.x.x.3, x.x.x.4, y.y.y.1 }
{ x.x.x.2, x.x.x.3, x.x.x.4, y.y.y.1 }
{ x.x.x.2, x.x.x.3, x.x.x.4, y.y.y.1, y.y.y.2 }
{ x.x.x.3, x.x.x.4, y.y.y.1, y.y.y.2 }
{ x.x.x.4, y.y.y.1, y.y.y.2, y.y.y.3 }
{ x.x.x.4, y.y.y.1, y.y.y.2, y.y.y.3, y.y.y.4 }
{ y.y.y.1, y.y.y.2, y.y.y.3, y.y.y.4 }

Even though it takes so long that while still processing update 2, the rollout is technically finished (and so it could optimize by directly applying 7), from the logs it seems to appear it still processes some updates of IPs that have rolled off.

I realize that maglev maintains several internal maps to allow the consistent hashing selection, and I imagine that's the reason for the slowdown as it updates them, though several minutes of unavailability are still quite baffling. If this is the expected behavior, and not an unexpected bug, it might be worth documenting it as a downside of maglev for services exposing several hundred ports. Would it also be a potential optimization (maybe not enabled by default) to reduce maglev to maintain maps for consistent selection of backends only per service instead of per port of service? This would mean two connections to the same service from the same source but to two different ports of the service would end up in the same backend, but it would still maintain the consistent hashing benefits and likely strongly reduce this overhead!
Thanks!

Cilium Users Document

Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

squeed · 2024-05-14T09:14:00Z

@tommasopozzetti good find.

Would you be able to capture a cpu pprof during an update? We have developer-focused instructions, but I can help out if those aren't useful enough.

tommasopozzetti · 2024-05-14T21:21:38Z

Hi @squeed, thanks for taking a look.
Here is a cpu pprof for 10s during which the service is inaccessible as the cilium agent is stuck in the loop of service upserted. Let me know if 10s is enough or if I can provide any further info!
pprof-cpu.zip

squeed · 2024-05-15T13:01:44Z

Indeed, looking at the pprof, it seems a lot of time is spent in GetLookupTable(). What's your maglev table size?

tommasopozzetti · 2024-05-15T13:07:46Z

@squeed I am running it with the default value 16381

tommasopozzetti added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 7, 2024

squeed added need-more-info More information is required to further debug or fix the issue. feature/lb-only Impacts cilium running in lb-only datapath mode area/loadbalancing Impacts load-balancing and Kubernetes service implementations labels May 14, 2024

github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 14, 2024

squeed added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cilium with maglev long delay managing services with several ports #32391

Cilium with maglev long delay managing services with several ports #32391

tommasopozzetti commented May 7, 2024

squeed commented May 14, 2024

tommasopozzetti commented May 14, 2024

squeed commented May 15, 2024

tommasopozzetti commented May 15, 2024

Cilium with maglev long delay managing services with several ports #32391

Cilium with maglev long delay managing services with several ports #32391

Comments

tommasopozzetti commented May 7, 2024

Is there an existing issue for this?

What happened?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

squeed commented May 14, 2024

tommasopozzetti commented May 14, 2024

squeed commented May 15, 2024

tommasopozzetti commented May 15, 2024