Skip to content

extend on summary and motivation#1

Open
kkourt wants to merge 1 commit intoAndreagit97:tetragon-workload-policiesfrom
kkourt:main
Open

extend on summary and motivation#1
kkourt wants to merge 1 commit intoAndreagit97:tetragon-workload-policiesfrom
kkourt:main

Conversation

@kkourt
Copy link
Copy Markdown

@kkourt kkourt commented Dec 18, 2025

No description provided.

Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
@kkourt
Copy link
Copy Markdown
Author

kkourt commented Dec 18, 2025

@Andreagit97 👋🏼

Found some time to work on this. Here's a first PR with some (relatively small) additions.

Copy link
Copy Markdown
Owner

@Andreagit97 Andreagit97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this!

The current eBPF implementation (one program + many maps per `TracingPolicy`) scales poorly in clusters with many per-workload policies. First, certain hook types have an attachment limit of `BPF_MAX_TRAMP_LINKS` ([38 on x86](https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138)). This means that we cannot load more than 38 policies on the same hook of this type. Second, for each loaded program we [check](https://github.com/cilium/tetragon/blob/fdd7f014e4172d09f4fcc250f8a5790e764428f8/bpf/process/policy_filter.h#L51-L54) whether the policy applies to the given workload. This wastes a lot of CPU cycles, especially in cases where processes match a small subset of the existing policies.

- P1 (Scaling): The current eBPF implementation (one program + many maps per `TracingPolicy`) scales poorly in clusters with many per-workload policies (attachment limits, redundant evaluations, memory growth).
(Note(kkourt): is there an argument to be made for reducing memory footprint on the eBPF maps?)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say yes. The memory is the main blocker we have right now.
Even with all the optimization, we opened upstream, on nodes with many CPUs (e.g., 96), there are some per-CPU maps (e.g., process_call_heap, string_maps_heap, data_heap) that bring the memory usage to something like 9 MB per policy. This is not ideal for our use case, where we want to create a Tracing policy for each container inside each Pod

Copy link
Copy Markdown
Author

@kkourt kkourt Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

In my mind, the approach to address the scalability issues of having one program per policy is to have a single program per hook (for all policies) and have them access different state (bpf maps). To reduce memory footprint, we need to take this approach one step further and have the different policies share maps (somehow).

Is that the general idea?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, the approach to address the scalability issues of having one program per policy is to have a single program per hook (for all policies) and have them access different state (bpf maps).

Yep, that's exactly what we ended up doing. We have a little agent that does exactly this, of course, it is easier since in our use case we just hook one single point (security_bprm_creds_for_exec) and we support just 2 operators (Equal, In)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this does not address the memory footprint issue, correct? Both approaches (one hooked program per policy, and one program per hook with different maps) use the same amount of memory in BPF maps. Or am I missing something?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solve the memory footprint issue, we used a strategy very similar to what we did in the POC cilium/tetragon#4279.
We have a unique ebpf prog with 2 maps:

  1. hash_map (key: cgroupID, value: policyID)
  2. hash of maps (key: policyID, value: hash_map(key: string, value:0/1)). This is the hashset of values for each policy
    So, from the cgroup, we understand the associated policy, and then we check if the current binary is present in the hashset.

It is probably possible to achieve the same memory footprint with both approaches: one prog per hook/one hooked prog per policy. We chose the one prog per hook approach because it is enough for us and allows us to use just one unique ebpf program for all the policies.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this does not work for cases where policies match multiple workloads (which is common enough use-case that we cannot exclude). How would above work for the generic case?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, I can see how we save memory with:

hash_map (key: cgroupID, value: policyID)

But it's not clear to me how we save memory with:

(key: policyID, value: hash_map(key: string, value:0/1)).

Isn't above the same in terms of memory footprint with what we have now (where we hold one map per policy)? Or am I missing something?

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but this does not work for cases where policies match multiple workloads

That's true, it doesn't cover this case.
In our use case, a cgroup can be associated with one and only one policy. This is something that cannot work with today's tetragon generic TracingPolicy concept, unless we want to introduce a specific policy that enforces this constraint by default.

Isn't above the same in terms of memory footprint with what we have now (where we hold one map per policy)?

You are right, I should probably correct my previous statement:

"To solve the memory footprint issue, we used a strategy very similar to what we did in the POC cilium/tetragon#4279.
We have a unique ebpf prog with 2 maps: ..."

What we actually did to solve the memory issue was to get rid of all the maps we don't need for our use case, ending up using just the 2 maps reported above. I reported the most memory-consuming maps here cilium/tetragon#4191 (comment). So yes, the memory saving doesn't come from that map usage, but from not using all the other maps that are not necessary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants