Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreted frames: UNREPORTED / UNRESOLVED frames due to reporter frames cache expiry #248

Open
Gandem opened this issue Nov 20, 2024 · 1 comment

Comments

@Gandem
Copy link
Contributor

Gandem commented Nov 20, 2024

What happened?

Initially spotted in #244 (comment), opening a separate issue since the fix might be more involved.

Running the profiler next to a Java process for more than an hour and noticed two things:
1. After one hour of runtime, stub frames have as function name UNRESOLVED, since their metadata is submitted only once since the addrToStubNameID never expires see: https://github.com/DataDog/opentelemetry-ebpf-profiler/blob/0599c7c681ee15f6c394c9b3acdf2dd45c7d8279/interpreter/hotspot/instance.go#L240
2. (not very critical) Once per hour, in a single reporting cycle, some of the frames would have as function name UNREPORTED, since the expiry logic expired an actively used file ID. The next reporting cycle fixes this, so this only affects one profile every hour.

Expected behavior: We shouldn't see UNRESOLVED / UNREPORTED frames after one hour.

Reproducing

On latest main (4f405f2):

  • Run the profiler via sudo ./ebpf-profiler -collection-agent=127.0.0.1:11000 -disable-tls -reporter-interval=1m
  • Wait for one hour (1h) after interpreted frames are seen for a Java service
  • Stub frames consistently have as function name UNRESOLVED (1.)
  • For a single profile every hour, some frames have UNREPORTED as their function name. (2.)

Potential fixes

The frames cache is an LRU cache of fileID -> map[libpf.AddressOrLineno]sourceInfo. The internal map has no limit on size.

We could use the newly added GetAndRefreshKeys() in go-freelru so that actively used entry in the frames cache are not expired after an hour. This would fix both issues, with the tradeoff that, for actively used files, entries in the inner map[libpf.AddressOrLineno]sourceInfo would never get GCed. This sounds reasonable since if a file is actively used, it would make sense to keep all potential line numbers in cache (I wouldn't expect the memory overhead to be significant due to rarely used line numbers).

Alternatively, if we want to ensure that the size of the inner map[libpf.AddressOrLineno]sourceInfo is capped, we could turn it into an LRU.

@rockdaboot
Copy link
Contributor

rockdaboot commented Nov 20, 2024

Thanks again for spotting this!

Using GetAndRefreshKeys() for outer cache is a good idea.

What concerns me a bit is that for long/forever running large executables there is currently the potential of increasing the inner map without limit. Leaf frames can be seen at any opcode address (a matter of probability).

Using an LRU instead of map for the inner caches to have a cap sound reasonable.
The problem I see is the fixed size of the LRU. The LRU is not able to grow and so we pre-allocate a relatively large amount of possibly never used memory for every fileid.

To mitigate that we could write a thin wrapper around the LRU (ResizableLRU) that has a cap but starts with e.g. 32 pre-allocated entries (starting size needs to be determined). Adding an element on a full LRU would lead to a resize until the cap is reached.

We also need to extend the eviction of expired items for the inner caches (search for r.frames.PurgeExpired() in otlp_reporter.go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants