Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.eh_frame: Read user stacks from kernel context #1342

Merged
merged 2 commits into from
Feb 15, 2023

Conversation

javierhonduco
Copy link
Contributor

@javierhonduco javierhonduco commented Feb 14, 2023

Context

If a userspace process was compiled with frame pointers, we can rely on
bpf_get_stackid with the BPF_F_USER_STACK flag to fetch user stacks.
It does all the hard work, including making sure we can walk user stacks
from kernel context, such as when we are executing a system call.

When this happens, the kernel saves the userspace registers in memory. The
registers we get through the perf events are the current registers. As we
are in kernel context, these are registers that point to kernel data and
code. For this reason, we need a way to recover said stored registers.

If we were building a kernel extension we would use the task_pt_regs
macro. There's a new helper 1 to achieve exactly this, bpf_task_pt_regs,
but unfortunately we can't always rely on it's included only in v5.15 or
greater.

Proposed solution

We walk the internal kernel datastructures until we can read the
x86_64 registers we need. In case we are in kernel context, and current
thread is not a kernel process (kworker), we read the saved task's
registers and continue the normal unwinding process.

Notes

For some reason, not really sure about why; needs more investigation, writing this code has not been a walk in the park. The verifier didn't allow me to do direct pointer dereferences, that's why the code is littered with manual ones. Mostly complained with

avoid R0 invalid mem access 'scalar'

The other interesting bit is that I can't make the verifier pass when checking for the return value of set_initial_state and returning if there is an issue.

dereference of modified ctx ptr R6 off=32 disallowed

Additionally, the full vmlinux.h has been dumped as we need the definition of a bunch of structs.

Future changes / areas of improvement

  • Once we have automatic feature calibration we can use the new helper, if available, which will be more reliable;
  • This code is pretty fragile. Any changes in the running box, such as running with KASAN=on, will make it fail, resulting in missing kernel stacks;

Test Plan

Ran it for 1h without issues. The kernel stacks now appear and look reasonable.

image

kernel tests

=============
Test results:
=============
- ✅ 5.4
- ✅ 5.10
- ✅ 5.18
- ✅ 5.19

Signed-off-by: Francisco Javier Honduvilla Coto [email protected]

@javierhonduco javierhonduco requested a review from a team as a code owner February 14, 2023 17:40
@javierhonduco javierhonduco force-pushed the kstacks-with-no-fp-in-userspace branch 5 times, most recently from 9d5cac3 to 4740546 Compare February 14, 2023 17:47
@v-thakkar v-thakkar assigned v-thakkar and unassigned v-thakkar Feb 14, 2023
@v-thakkar v-thakkar self-requested a review February 14, 2023 18:05
Context
=======

If a userspace process was compiled with frame pointers, we can rely on
`bpf_get_stackid` with the `BPF_F_USER_STACK` flag to fetch user stacks.
It does all the hard work, including making sure we can walk user stacks
from kernel context, such as when we are executing a system call.

When this happens, the kernel saves the userspace registers in memory. The
registers we get through the perf events are the current registers. As we
are in kernel context, these are registers that point to kernel data and
code. For this reason, we need a way to recover said stored registers.

If we were building a kernel extension we would use the `task_pt_regs`
macro. There's a new helper [1] to achieve exactly this, `bpf_task_pt_regs`,
but unfortunately we can't always rely on it's included only in v5.15 or
greater.

Proposed solution
=================

We walk the internal kernel datastructures until we can read the
x86_64 registers we need. In case we are in kernel context, and current
thread is not a kernel process (kworker), we read the saved task's
registers and continue the normal unwinding process.

Notes
=====

For some reason, not really sure about why; needs more investigation,
writing this code has not been a walk in the park. The verifier didn't
allow me to do direct pointer dereferences, that's why the code is
littered with manual ones. Mostly complained with

> `avoid R0 invalid mem access 'scalar'`

The other interesting bit is that I can't make the verifier pass when
checking for the return value of `set_initial_state` and returning if
there is an issue.

> dereference of modified ctx ptr R6 off=32 disallowed

Additonally, the full `vmlinux.h` has been dumped as we need the
definition of a bunch of structs.

Future changes / areas of improvement
=====================================

- Once we have automatic feature calibration we can use the new helper, if
available, which will be more reliable;
- This code is pretty fragile. Any changes in the running box, such as
  running with KASAN=on, will make it fail, resulting in missing kernel
stacks;

Test Plan
=========

Ran it for 1h without issues. The kernel stacks now appear and look
reasonable. See PR for more.

**kernel tests**
```
=============
Test results:
=============
- ✅ 5.4
- ✅ 5.10
- ✅ 5.18
- ✅ 5.19
```

- [0]: https://github.com/torvalds/linux/blob/3d7cb6b04c3f3115719235cc6866b10326de34cd/arch/x86/include/asm/processor.h#L758-L763
- [1]: torvalds/linux@dd6e10f

Signed-off-by: Francisco Javier Honduvilla Coto <[email protected]>
Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ❤️

I need to understand in-depth the implications of using bpf_probe_read_kernel and bpf_get_current_task for #1131

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add some sort of documentation to denote which version of the Kernel that vmlinux.h generated?

Have you ever seen a guideline around which version should be used? Minimum supported version? Maximum supported version?

@kakkoyun
Copy link
Member

LGTM ❤️

I need to understand in-depth the implications of using bpf_probe_read_kernel and bpf_get_current_task for #1131

Never mind. bpf_probe_read_kernel CO:RE compatible already. But now I wonder what actually the bpf_get_current_task_btf variation does? 🤔

Copy link
Contributor

@v-thakkar v-thakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work! 🎉

Not entirely related to this review but it's probably a good time to start thinking about having unit tests for our BPF program. While manual e2e tests are helping us right now, I think it'll be nice to make sure that some of the BPF code is behaving the way we intend it to. Cilium has the unit testing framework for their BPF datapath. Maybe we can have something similar: https://docs.cilium.io/en/latest/contributing/testing/bpf/#bpf-testing? WDYT?

Copy link
Member

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super clear and easy to understand, nice job! lgtm

@javierhonduco
Copy link
Contributor Author

I need to understand in-depth the implications of using bpf_probe_read_kernel and bpf_get_current_task for #1131

Yes, let's discuss this. Your work to have BTF is super relevant now that access kernel structures.

Have you ever seen a guideline around which version should be used? Minimum supported version? Maximum supported version?

Good question! As far as I know, as long as the architecture is matched, we can use any version we want that has everything we need. There's a lot of details here, so will gather them in the next week and get back to you.

But now I wonder what actually the bpf_get_current_task_btf variation does? 🤔

It seems to be mostly the same, but with extra metadata for BTF. I think it got introduced in [v5.11-rc1](https://github.com/torvalds/linux/releases/tag/v5.11-rc1), so it's best to use the other helper to not lock us in even more in modern kernels.

Not entirely related to this review but it's probably a good time to start thinking about having unit tests for our BPF program. While manual e2e tests are helping us right now, I think it'll be nice to make sure that some of the BPF code is behaving the way we intend it to. Cilium has the unit testing framework for their BPF datapath. Maybe we can have something similar: https://docs.cilium.io/en/latest/contributing/testing/bpf/#bpf-testing? WDYT?

I totally agree! This is something we've been discussing over the next couple of weeks. We need tests to check fo the correctness of the stacks. This is something that both Kemal and I are working on.

@javierhonduco
Copy link
Contributor Author

Thanks for your comments, folks! Really appreciate them. I'll merge and continue iterating on this, let's discuss the testing and BTF + CO-RE in the next few days 😄

@javierhonduco javierhonduco merged commit f327e97 into main Feb 15, 2023
@javierhonduco javierhonduco deleted the kstacks-with-no-fp-in-userspace branch February 15, 2023 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants