WIP: Proposal: Support Timestamped Profiling Events #594

felixge · 2024-10-31T13:24:08Z

🚧 WORK IN PROGRESS 🚧: This proposal is still in the early stages. I'm mostly looking for feedback from active participants in the profiling SIG before soliciting feedback from the wider community.

aalexand · 2024-10-31T15:44:11Z

opentelemetry/proto/profiles/v1development/eventprofile.proto

+  optional uint32 link_index = 3;
+
+  // The indices of the attributes associated with this event.
+  repeated uint32 attribute_indices = 4;


Attribute value units need to be specified somehow. In the current profiling format there is a sidecar field for that.

You mean the AttributeUnit messages? Yeah, we can add that - I don't recall deleting it intentionally.

Yes, that one.

aalexand · 2024-10-31T15:45:20Z

opentelemetry/proto/profiles/v1development/eventprofile.proto

+  // wire is the number of clock cycles since the start of the associated Clock.
+  // For subsequent Event messages for the same Clock, the time value is the
+  // number of clock cycles since the previous message on the wire.
+  int64 time = 5;


Maybe name this "timestamp" or some other name that makes it clearer that this is a point in time? Just "time" can be read as duration or a single event or aggregated duration.

I'm open to alternative naming. My worry with timestamp is that people would miss the delta encoding requirement. But that could happen with the current name as well I guess. I don't feel strongly about it.

Some ideas, with their pros and cons:

An explicit name like timestamp_or_delta, time_or_delta.

Oneof with timestamp and time_delta fields.

aalexand · 2024-10-31T15:46:20Z

opentelemetry/proto/profiles/v1development/eventprofile.proto

+  // The default value for the Event.value field. This is used when the value
+  // field is the same for all events of this type. For example CPU samples may
+  // typically represent the same value of 10ms.
+  uint64 default_value = 4;


This feels more like weight than "default value" to me.

I took the term value from pprof's Sample.value field. weight would also work for me. I don't feel strongly about it.

I see. No strong opinion, we can discuss later if anyone becomes more opinionated.

aalexand · 2024-10-31T15:52:56Z

opentelemetry/proto/profiles/v1development/eventprofile.proto

+
+// Clock describes the clock format used by the EventType. Clocks are expected
+// to be monotonic and include the time when a system is suspended.
+message Clock {


I like the idea of having the clock type to allow configurable and different resolution of timestamps in the same or different profiles, but one downside is - does it make simple and typical case more verbose? E.g. perhaps many collectors might end up just using Unix epoch microseconds with no attributes, but they will have to configure at least one clock.

Yeah, it's a tradeoff. I added the clocks for a few different reasons. If I had to put them in order, it would probably be:

Leap Seconds: IMO observability systems should avoid unix epoch timestamps until leap seconds are abolished in 2035. At least I would imagine that they make it significantly less fun to debug incidents that are caused by leap seconds. But unfortunately this is a problem for all other OpenTelemetry signals already.

Data volumes: Changing the resolution could save 1-2 bytes per event under varint encoding.

TSC Clocks: Those can run at frequencies above 1 Ghz, so nanosecond resolution is not sufficient to store their values. OTOH, I can think of very few use cases that require sub-nanosecond resolution, especially for the stuff we target with OpenTelemetry.

That being said, the first problem is solvable without configurable clocks. If we decide 2 and 3 are not important enough, we should probably just require timestamps that originate from a non-leap unix epoch with nanosecond precision?

2 and 3 could be solved by having some way of specifying the resolution of the timestamp per event type? E.g. nanos, micros, or millis.

We should probably look if all other signals follow some specific approach with this. Being extra special is probably not worth it.

beaubelgrave · 2024-10-31T20:25:21Z

opentelemetry/proto/profiles/v1development/eventprofile.proto

+  uint32 event_type_index = 1;
+
+  // The index of the Stack associated with this event.
+  uint32 stack_index = 2;


It would be really nice in addition to the stack if we could also include raw bytes of the event, along with some way to define the format of the Event (maybe in EventType). This goes back to the issue you and I discussed about store JFR / DotNet like events in this format someway.

@beaubelgrave that's worth considering. protobuf's stdlib already has an Any type that can be used to embed raw protobuf messages. Maybe we could do something similar here?

Also: Link to the previous discussion for others reading here: open-telemetry/opentelemetry-specification#3979

Yeah, that Any type would work well, since it gives us a type url we can use to indicate format like we were kind of circling around in the issue discussion. I'd just want to ensure that we can put the payload in this Event object and also have an optional format specified for the payload in the EventType object.

The scenario is we dump raw bytes in the Event objects and label them as "perf_event" or "tracepoint" when we get our data from Linux tracepoints. Then we'd have the tracepoint format data within the EventType struct. That way we could have small payloads in the Event with a larger format description in the EventType that has a 1:N relationship with Event (if I read the patches right).

florianl · 2024-11-04T07:49:52Z

opentelemetry/proto/profiles/v1development/eventprofile.md

+
+This will enable the following use cases.
+
+1. Richer analysis of On-CPU as well as Off-CPU data.


On- and Off-CPU profiling are just two profiling use cases. How about making it less specific:

Richer analysis of aggregated data, like On-CPU profiling, as well as event based data, like Off-CPU or memory allocation profiling.

Richer analysis of aggregated data, like On-CPU profiling

Not sure I follow. The use case enabled by this proposal is event based CPU profiling rather than aggregation. We already have aggregation?

OTel ebpf profiler could aggregate on CPU profiling data. This is possible by setting Sample.value to a different value than 1 and provide a timestamp in Sample.timestamps_unix_nano for each Sample.value.
Currently, this aggregation is not done as it was rejected during the review process and so it is still a low hanging fruit for optimization.

opentelemetry/proto/profiles/v1development/eventprofile.md

florianl · 2024-11-04T11:47:53Z

opentelemetry/proto/profiles/v1development/eventprofile.proto

+import "opentelemetry/proto/common/v1/common.proto";
+
+// Profile holds a collection of events that represent a profile.
+message Profile {


Did you consider having a single event_type for a message Profile?
For message Profile with a single event_type processing could look like this:

flowchart TD A[Shared Envelope] --> B{" "} B --> D[Profile for on-CPU profiling] B --> E[Profile for off-CPU profiling] B --> F[Profile for memory allocation profiling]

Loading

The string table, attribute table and other shared data then should be part of the Shared Envelope (I'm bad in naming things 🙏 ). This would allow faster handling of profiling data by processors based on event_type. With the suggested approach processors have to walk all Profile.events to handle a specific event_type.

That's an interesting idea. I can see how this could simplify things. I'll think about this a bit more!

tsint · 2024-11-05T03:48:09Z

opentelemetry/proto/profiles/v1development/eventprofile.md

+    {
+        event_type_index: 1             # thread_state
+        stack_index:      1             # main;foo;sleep
+        time:             5_000_000     # 5ms since previous event


What do you think about the stall profiling mentioned in Brendan Gregg's blog on AI flame graphs (https://www.brendangregg.com/blog/2024-10-29/ai-flame-graphs.html)? If this kind of flame graph needs to be supported, there are multiple samples taken in the sleep state, which might cause recording confusion or data inconsistencies.

IMHO, if we had a duration in addition to time, we could make off_cpu and stall events clearer. IE: Stall started at time X, lasted for Y. This gives a start/stop. This could be stored in an attribute, but feels somewhat wasteful to have to have an attribute per-duration value.

Is the value of the event always considered a duration? Often times I have events where the value is less time based (like memory allocations). But for off-CPU / stalls, the value is more duration based.

duration and value are separate fields in the current proposal, so I think the use cases brought up here should be covered? https://github.com/open-telemetry/opentelemetry-proto/pull/594/files/7d85bba21098f7950813d9717f06e86865da0726#diff-4a0777f946b5d445b17b64906a2ee5b121a070ee499189a656988ad0df350e11R81-R94

Yeah, sorry I missed that! Definitely covers what I was asking for, thanks!

tsint · 2024-11-05T04:14:59Z

opentelemetry/proto/profiles/v1development/eventprofile.md

+strings: [
+    "",              # index 0
+    "allocations",   # index 1
+    "liveheap",      # index 2


I think it's necessary to differentiate between virtual memory, physical memory, and situations like page faults. Typically, virtual memory consumed is more than physical memory because some memory regions might be zeroed and never referenced. Some might be 'copy-on-write'.

Yeah. This needs to be clarified, probably via semantic conventions. But the example given here is allocations profiled by the Go runtime. Those are always measured in terms of virtual memory. I wrote up more details on this here.

beaubelgrave · 2024-11-05T17:38:40Z

opentelemetry/proto/profiles/v1development/eventprofile.md

+    {
+        event_type_index: 0  # cpu_sample (can be omitted on the wire)
+        stack_index: 2       # main;foo
+        time:        300_000 # 0.3ms since previous event


Why is time based on previous event? This implies we have events in perfect time order. Often during collection, especially off-CPU (cswitch) events, the start of the event is not in-order of when it completes. This would cause collectors to sort potentially large sets of data (cswitches can be massive).

Can we just have an absolute time option vs the delta time encoding scheme?

I imagine delta encoding is used to reduce payload size

The purpose of delta encoding is to reduce payload sizes. time is an int64 right now, so negative values are possible. If we frequently need to deal with unordered events, we could define time as an sint64 which allows to encode negative numbers more efficiently. However, it comes a the expense of encoding positive numbers.

Anyway, I don't feel very strongly about delta encoding. I'm planning to do some benchmarks later to provide some evidence (or lack of evidence) that it's going to be beneficial.

64-bits are already variable encoded, and if they were deltas off of a stable epoch (like profile start time). They would be quite small without deltas between each other. I get why we would want delta encoding, but it comes at a sorting cost that is likely really bad for off-cpu data sources (since they complete at random times compared to the start time).

petethepig · 2024-11-01T16:09:23Z

opentelemetry/proto/profiles/v1development/eventprofile.md

+Most of the current OpenTelemetry profiling format, including the concept of aggregating stack traces, is inherited from pprof.
+However, given that we have [decided](https://github.com/open-telemetry/opentelemetry-proto/issues/567#issuecomment-2286565449) that strict pprof compatibility is not a goal for the OpenTelemetry, we are now free to design a more flexible and extensible format that can be used for a wider range of profiling use cases than pprof.
+
+One use case that recently came up is the [collection of Off-CPU profiling data for the ebpf profiler](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/144). The attendees of the [Sep 5](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit?tab=t.0#heading=h.lenvx4xd62c6) SIG meeting agreed that aggregated profiling data is not ideal for this use case, as it increases the difficulty to reason about the collected data and to correlate it with application behavior. This is especially true when it comes to the analysis of tail latency problems caused by outliers. So instead aggregation, it is much more useful to record this data as individual events with timestamps as well as additional context such as thread id, trace id and span id. This becomes even more powerful when On-CPU samples are also recorded with timestamps, as it allows users to identify spikes and stalls of CPU activity as well as Off-CPU examplars that explain the cause of the stalls and resulting latency (or lack of throughput).


aggregated profiling data is not ideal for this use case

FWIW, I used aggregated off-cpu profiles in the past (offcputime from bcctools) and I found them quite useful for figuring our where applications are waiting for IO.

Ack. I'm not trying to suggest that aggregated Off-CPU profiles are useless. But they tend to be dominated by idle threads, so it can be tricky to navigate them. Additionally they can't really be used to analyze tail latencies since all they show are averages.

FWIW, we use a mix of both. Overlaying active CPU over aggregate wait CPU shows where the CPU is trying to execute and getting delayed. From there we allow zooming into individual time sections involved in the aggregate to debug thread interactions, etc. We need both within the same profile to achieve the same thing we have in our private formats.

petethepig · 2024-11-08T21:54:29Z

opentelemetry/proto/profiles/v1development/eventprofile.md

+    {
+        event_type_index: 0  # cpu_sample (can be omitted on the wire)
+        stack_index: 2       # main;foo
+        time:        300_000 # 0.3ms since previous event


I imagine delta encoding is used to reduce payload size

Co-authored-by: Florian Lehner <[email protected]>

- Introduce a first-class Stack message type and lookup table. - Replace location index range based stack trace encoding on Sample with a single stack_index reference. - Remove the location_indices lookup table. The primary motivation is laying the ground work for [timestamp based profiling][timestamp proposal] where the same stack trace needs to be referenced much more frequently compared to aggregation based on low cardinality attributes. Timestamp based profiling is also expected to be used with the the upcoming [Off-CPU profiling][off-cpu pr] feature in the eBPF profiler. Off-CPU stack traces have a different distribution compared to CPU samples. In particular stack traces are much more repetitive because they only occur at call sites such as syscalls. For the same reason it is also uncommon to see a stack trace are a root-prefix of a previously observed stack trace. We might need to revisit the previous [previous benchmarks][benchmarks] to confirm these claims. The secondary motivation is simplicitly. Arguably the proposed change here will make it easier to write exporters, processors as well as receivers. It seems like we had rough consensus around this change in previous SIG meetings, and it seems like a good incremental step to make progress on the timestamp proposal. [timestamp proposal]: #594 [off-cpu pr]: open-telemetry/opentelemetry-ebpf-profiler#196 [benchmarks]: https://docs.google.com/spreadsheets/d/1Q-6MlegV8xLYdz5WD5iPxQU2tsfodX1-CDV1WeGzyQ0/edit?gid=2069300294#gid=2069300294

- Introduce a first-class Stack message type and lookup table. - Replace location index range based stack trace encoding on Sample with a single stack_index reference. - Remove the location_indices lookup table. The primary motivation is laying the ground work for [timestamp based profiling][timestamp proposal] where the same stack trace needs to be referenced much more frequently compared to aggregation based on low cardinality attributes. Timestamp based profiling is also expected to be used with the the upcoming [Off-CPU profiling][off-cpu pr] feature in the eBPF profiler. Off-CPU stack traces have a different distribution compared to CPU samples. In particular stack traces are much more repetitive because they only occur at call sites such as syscalls. For the same reason it is also uncommon to see a stack trace are a root-prefix of a previously observed stack trace. We might need to revisit the previous [previous benchmarks][benchmarks] to confirm these claims. The secondary motivation is simplicitly. Arguably the proposed change here will make it easier to write exporters, processors as well as receivers. It seems like we had rough consensus around this change in previous SIG meetings, and it seems like a good incremental step to make progress on the timestamp proposal. [timestamp proposal]: open-telemetry#594 [off-cpu pr]: open-telemetry/opentelemetry-ebpf-profiler#196 [benchmarks]: https://docs.google.com/spreadsheets/d/1Q-6MlegV8xLYdz5WD5iPxQU2tsfodX1-CDV1WeGzyQ0/edit?gid=2069300294#gid=2069300294

wip

7d85bba

aalexand reviewed Oct 31, 2024

View reviewed changes

beaubelgrave reviewed Oct 31, 2024

View reviewed changes

florianl reviewed Nov 4, 2024

View reviewed changes

tsint reviewed Nov 5, 2024

View reviewed changes

beaubelgrave reviewed Nov 5, 2024

View reviewed changes

petethepig reviewed Nov 8, 2024

View reviewed changes

Update opentelemetry/proto/profiles/v1development/eventprofile.md

d28eb12

Co-authored-by: Florian Lehner <[email protected]>

felixge mentioned this pull request Jan 9, 2025

Simplify profile stack trace representation #615

Closed

tigrannajaryan added the spec:profiles label Mar 26, 2025

felixge mentioned this pull request Apr 17, 2025

Profiles: simplify profile stack trace representation #645

Draft


		This will enable the following use cases.

		1. Richer analysis of On-CPU as well as Off-CPU data.

WIP: Proposal: Support Timestamped Profiling Events #594

Are you sure you want to change the base?

WIP: Proposal: Support Timestamped Profiling Events #594

Uh oh!

Conversation

felixge commented Oct 31, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixge Nov 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

florianl Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beaubelgrave Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixge Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixge Nov 1, 2024 •

edited

Loading

florianl Nov 4, 2024 •

edited

Loading

beaubelgrave Nov 5, 2024 •

edited

Loading

felixge Nov 11, 2024 •

edited

Loading