Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium with maglev long delay managing services with several ports #32391

Open
2 of 3 tasks
tommasopozzetti opened this issue May 7, 2024 · 4 comments
Open
2 of 3 tasks
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/lb-only Impacts cilium running in lb-only datapath mode info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@tommasopozzetti
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Cilium with maglev loadbalancing enabled takes a very long time to update internal structures when there are endpoint changes for services with high number of ports (in the hundreds).
This is reproducible deterministically by creating a simple deployment with 4 replicas behind a Service that defines 400+ ports. It is sufficient to execute a rolling restart of the deployment to have connections going through the service fail for several minutes (while connections directly to the pods are working fine).
This problem disappears if the load balancing algorithm is switched to random, in which case the update is instantaneous.

Cilium Version

1.15.2

Kernel Version

5.4

Kubernetes Version

1.27

Regression

No response

Sysdump

No response

Relevant log output

--- Many minutes of continuous lines of the following from cilium-dbg monitor ---
>> Service upserted: {...}
>> Service upserted: {...}
>> Service upserted: {...}

Anything else?

The logs above are continuous during the minutes of unreachability of the service and stop right when the service becomes available again. It looks like Cilium is processing one of these updates per port and per IP (so a LoadBalancer service with an external IP and one nodeport allocated per port of the service exponentially worsens this behavior).
It also seems like cilium is processing the same port and ip combination (same id in the json above) over and over again because of the rollout restart. Assuming the 4 pods backing the service before the restart are x.x.x.1, x.x.x.2, x.x.x.3 and x.x.x.4 and after the rollout they are y.y.y.1, y.y.y.2, y.y.y.3, y.y.y.4, it seems cilium processes all the updates in order:

  1. {x.x.x.1, x.x.x.2, x.x.x.3, x.x.x.4, y.y.y.1 }
  2. { x.x.x.2, x.x.x.3, x.x.x.4, y.y.y.1 }
  3. { x.x.x.2, x.x.x.3, x.x.x.4, y.y.y.1, y.y.y.2 }
  4. { x.x.x.3, x.x.x.4, y.y.y.1, y.y.y.2 }
  5. { x.x.x.4, y.y.y.1, y.y.y.2, y.y.y.3 }
  6. { x.x.x.4, y.y.y.1, y.y.y.2, y.y.y.3, y.y.y.4 }
  7. { y.y.y.1, y.y.y.2, y.y.y.3, y.y.y.4 }

Even though it takes so long that while still processing update 2, the rollout is technically finished (and so it could optimize by directly applying 7), from the logs it seems to appear it still processes some updates of IPs that have rolled off.

I realize that maglev maintains several internal maps to allow the consistent hashing selection, and I imagine that's the reason for the slowdown as it updates them, though several minutes of unavailability are still quite baffling. If this is the expected behavior, and not an unexpected bug, it might be worth documenting it as a downside of maglev for services exposing several hundred ports. Would it also be a potential optimization (maybe not enabled by default) to reduce maglev to maintain maps for consistent selection of backends only per service instead of per port of service? This would mean two connections to the same service from the same source but to two different ports of the service would end up in the same backend, but it would still maintain the consistent hashing benefits and likely strongly reduce this overhead!
Thanks!

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@tommasopozzetti tommasopozzetti added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 7, 2024
@squeed
Copy link
Contributor

squeed commented May 14, 2024

@tommasopozzetti good find.

Would you be able to capture a cpu pprof during an update? We have developer-focused instructions, but I can help out if those aren't useful enough.

@squeed squeed added need-more-info More information is required to further debug or fix the issue. feature/lb-only Impacts cilium running in lb-only datapath mode area/loadbalancing Impacts load-balancing and Kubernetes service implementations labels May 14, 2024
@tommasopozzetti
Copy link
Author

Hi @squeed, thanks for taking a look.
Here is a cpu pprof for 10s during which the service is inaccessible as the cilium agent is stuck in the loop of service upserted. Let me know if 10s is enough or if I can provide any further info!
pprof-cpu.zip

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 14, 2024
@squeed
Copy link
Contributor

squeed commented May 15, 2024

Indeed, looking at the pprof, it seems a lot of time is spent in GetLookupTable(). What's your maglev table size?

@tommasopozzetti
Copy link
Author

@squeed I am running it with the default value 16381

@squeed squeed added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/lb-only Impacts cilium running in lb-only datapath mode info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

2 participants