Conversation
|
Initial comment, could you please follow https://github.com/cilium/design-cfps?tab=readme-ov-file#how-to-create-cfps? In particular could you file an issue at https://github.com/cilium/cilium to get a CFP number and embed that here, then also please follow the standard CFP template. The CFP template is intended to structure the discussion to ensure "why" is covered, as well as identifying impact of the proposal and any other key questions that you would like to draw attention to. |
| * It is complex to write and hard to debug if something goes wrong. | ||
|
|
||
| **The big secret**: Most of the memory savings (99%) come from letting pods share the *same set of rules* (e.g. "All web servers use Policy Set 1"). We can do this with standard BPF features without needing Arena. | ||
|
|
There was a problem hiding this comment.
Currently each endpoint has a separate bpf policy map, tied to the endpoint's bpf program, even if they happen to have the same security identity. This used to be necessary due to endpoint specific config options. If the intent is to only share the policy map between multiple endpoints, then this should be accomplishable without any datapath changes by sharing the same bpf map between the endpoints.
Policy map sharing could be based on the endpoints already sharing the same security identity. Since it may still be possible for endpoint specific config to have an effect here, this should be an opt-in via a command line/helm option. This should also be rather straight-forward to implement.
It is also possible that endpoints with different security identities generate the same policy map, possibly from same or different rules. Maybe detecting that the selector policy is the same, even if generated for a different security identity could work as the driver for deduplication. Since each endpoint already points to a specific bpf policy map, also this option should be possible without any bpf datapath changes by simply reusing the same bpf map for all these endpoints.
One unknown if above cases is the effect of named ports, that can be endpoint specific.
Where bpf datapath work would be useful would be to deduplicate and share LPM tries without the remote identity in the LPM key. This would work by first looking up the remote identity from a separate uniques or shared bpf hash map, resulting into an LPM identifier that would then be used as part of the LPM trie key instead of the remote identity, like today. The first hash map could be unique or shared (along the lines of the 3 paragraphs above), while the LPM map holding all LPM tries would be shared among all the endpoints. Here the deduplication challenge is harder and must integrate with the algorithms in pkg/policy/mapstate.go. If deduplication is based on computed MapStateEntries, then named ports are not an issue either, as at the MapState level those have already been resolved, but this would mean that the Policy map computation would be performed for each endpoint separately, and then deduplicated after the fact. Agent code churn is much higher in this case.
There was a problem hiding this comment.
Thanks for the great and detailed feedback, I truly appreciate it. Yes you are right and I see how mapstate.go works and why named ports could be a trap. Here is why I think the Shared LPM + Overlay is better, and how we can do it without a massive rewrite of the Go code (I could be wrong though 😅)
If we just reuse the old maps, it is easy for the kernel but hard for Go. I propose to use exactly 2 maps for the whole node. We create them once at startup and never delete them. Sharing old maps still means creating new maps for different apps(fronted, backend, ect.). In big clusters, we can still hit Linux file limits, I guess (but I agree limit is probably high enough). And it will be easier to update. If Pod A and Pod B share a map, and then you add a unique rule to Pod A, Go has to make a new map and swap it live. Instead, we can just change one number in the Overlay map (Pod A points to Set 7 instead of Set 5). No swapping or reloading needed.
And I think we don't need to touch the "math" inside mapstate.go. We can just change what happens after it is done.:
- Go finishes calculating the rules for Pod A.
- Instead of pushing to a private map, we send the rules to a new manager.
- This manager checks if another pod is already using these exact rules.
- If Yes: It gets the ID for those rules.
- If No: It saves the rules to the Shared LPM map and gets a new ID.
- We update the Overlay map so Pod A points to that ID.
And we avoid the named port trap because we do our sharing after Go converts the names to numbers. I'll take another look tomorrow, if I missed something. What do you think?
Thank you again.
Signed-off-by: Tsotne Chakhvadze <tsotne@google.com>
0ab9370 to
4df11fa
Compare
thank you, made a bug in cilium and updated cfp based on template. |
Shared policy lpm trie