extend on summary and motivation#1
extend on summary and motivation#1kkourt wants to merge 1 commit intoAndreagit97:tetragon-workload-policiesfrom
Conversation
Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
|
@Andreagit97 👋🏼 Found some time to work on this. Here's a first PR with some (relatively small) additions. |
| The current eBPF implementation (one program + many maps per `TracingPolicy`) scales poorly in clusters with many per-workload policies. First, certain hook types have an attachment limit of `BPF_MAX_TRAMP_LINKS` ([38 on x86](https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138)). This means that we cannot load more than 38 policies on the same hook of this type. Second, for each loaded program we [check](https://github.com/cilium/tetragon/blob/fdd7f014e4172d09f4fcc250f8a5790e764428f8/bpf/process/policy_filter.h#L51-L54) whether the policy applies to the given workload. This wastes a lot of CPU cycles, especially in cases where processes match a small subset of the existing policies. | ||
|
|
||
| - P1 (Scaling): The current eBPF implementation (one program + many maps per `TracingPolicy`) scales poorly in clusters with many per-workload policies (attachment limits, redundant evaluations, memory growth). | ||
| (Note(kkourt): is there an argument to be made for reducing memory footprint on the eBPF maps?) |
There was a problem hiding this comment.
I would say yes. The memory is the main blocker we have right now.
Even with all the optimization, we opened upstream, on nodes with many CPUs (e.g., 96), there are some per-CPU maps (e.g., process_call_heap, string_maps_heap, data_heap) that bring the memory usage to something like 9 MB per policy. This is not ideal for our use case, where we want to create a Tracing policy for each container inside each Pod
There was a problem hiding this comment.
I see.
In my mind, the approach to address the scalability issues of having one program per policy is to have a single program per hook (for all policies) and have them access different state (bpf maps). To reduce memory footprint, we need to take this approach one step further and have the different policies share maps (somehow).
Is that the general idea?
There was a problem hiding this comment.
In my mind, the approach to address the scalability issues of having one program per policy is to have a single program per hook (for all policies) and have them access different state (bpf maps).
Yep, that's exactly what we ended up doing. We have a little agent that does exactly this, of course, it is easier since in our use case we just hook one single point (security_bprm_creds_for_exec) and we support just 2 operators (Equal, In)
There was a problem hiding this comment.
But this does not address the memory footprint issue, correct? Both approaches (one hooked program per policy, and one program per hook with different maps) use the same amount of memory in BPF maps. Or am I missing something?
There was a problem hiding this comment.
To solve the memory footprint issue, we used a strategy very similar to what we did in the POC cilium/tetragon#4279.
We have a unique ebpf prog with 2 maps:
- hash_map (key: cgroupID, value: policyID)
- hash of maps (key: policyID, value: hash_map(key: string, value:0/1)). This is the hashset of values for each policy
So, from the cgroup, we understand the associated policy, and then we check if the current binary is present in the hashset.
It is probably possible to achieve the same memory footprint with both approaches: one prog per hook/one hooked prog per policy. We chose the one prog per hook approach because it is enough for us and allows us to use just one unique ebpf program for all the policies.
There was a problem hiding this comment.
OK, but this does not work for cases where policies match multiple workloads (which is common enough use-case that we cannot exclude). How would above work for the generic case?
There was a problem hiding this comment.
Moreover, I can see how we save memory with:
hash_map (key: cgroupID, value: policyID)
But it's not clear to me how we save memory with:
(key: policyID, value: hash_map(key: string, value:0/1)).
Isn't above the same in terms of memory footprint with what we have now (where we hold one map per policy)? Or am I missing something?
There was a problem hiding this comment.
OK, but this does not work for cases where policies match multiple workloads
That's true, it doesn't cover this case.
In our use case, a cgroup can be associated with one and only one policy. This is something that cannot work with today's tetragon generic TracingPolicy concept, unless we want to introduce a specific policy that enforces this constraint by default.
Isn't above the same in terms of memory footprint with what we have now (where we hold one map per policy)?
You are right, I should probably correct my previous statement:
"To solve the memory footprint issue, we used a strategy very similar to what we did in the POC cilium/tetragon#4279.
We have a unique ebpf prog with 2 maps: ..."
What we actually did to solve the memory issue was to get rid of all the maps we don't need for our use case, ending up using just the 2 maps reported above. I reported the most memory-consuming maps here cilium/tetragon#4191 (comment). So yes, the memory saving doesn't come from that map usage, but from not using all the other maps that are not necessary
No description provided.