telemetry: record an API call's LLB op digests #7068

vito · 2024-04-09T21:00:01Z

This data allows the UI to track where lazy operations came from so we can tie them back to the call site that installed them (i.e. withExec). This visualization angle is super useful when you're trying to gauge the "cost" of API calls.

Something like this (note the "shadow bars" - UI subject to change):

NOTE: As a prerequisite, I moved the ftp_proxy hackery back to Container.WithExec instead of doing it "just-in-time" in the Buildkit client. Otherwise the span ends up with a different digest than what we return, so they can't be correlated at all. The switch from Progrock => OpenTelemetry allows us to do this, I believe; we couldn't before because we had to pass the Progrock parent vertex ID, but the OpenTelemetry equivalent is handled for us by Buildkit itself.

(edit - side note: it seems like we could potentially remove the ftp_proxy hack entirely if we replace ServerID with the OpenTelemetry trace + span ID.)

jedevc · 2024-04-10T13:03:25Z

engine/buildkit/client.go

- // If we already solved for this execop in this dagger session, use
- // it's uncached metadata to preserve the same vertex digest.
- // This results in buildkit's solver de-duping the op, instead of
- // the scheduler's edge merge.


So this was originally added because of #6505.

Because we're not using buildkit's progress stream this shouldn't be an issue anymore right?

We still would be doing more edge merging again because of this - since these are likely what cause inconsistent graph state issues, we might see more of those (though that would definitely help debugging it lol).

If there's an easy way to keep this hack in (I like the explicitness of it, instead of hiding the details in the solver), I'd prefer that, but not a blocker.

(edit - side note: it seems like we could potentially remove the ftp_proxy hack entirely if we replace ServerID with the OpenTelemetry trace + span ID.)

I like this - this is the best of all worlds - we get this feature, and we get rid of our nasty hack.

I like this - this is the best of all worlds - we get this feature, and we get rid of our nasty hack.

Yeah, I really want to do this, but it probably makes the most sense to do it after #6914 and #6916 (cc @sipsma) since at that point we for sure don't need the ParentClientIDs value, and solely have the ServerID to figure out. At that point theoretically we just stop using a random ID for it and use something analogous to $traceparent instead. (Carefully pointing out we need trace ID + span ID, not just trace ID, to handle nesting.)

We'd be adding a hard dependency on OpenTelemetry at this layer, but that seems worth the trade-off to me.

Have invested 0 time looking myself because on mobile atm, but how is the otel trace id propagated? Not necessarily opposed to using it as the server id but if we can just propagate the server id the same way we propagate the trace id that might be even better (just in the interest of not creating too much entanglement).

Also, the approach I'm adding in the CA cert PR of using a custom Executor rather than the shim should in the long term let us migrate everything in the shim and also be another route of avoiding the ftp proxy hack. Just noting for completeness.

@sipsma We can almost certainly use the new custom Executor if we want to yeah. Right now Buildkit propagates it for us with its OpenTelemetry integration - somewhere around here I suppose (I never had to look).

FWIW, it seems worth considering making ServerID == TraceID+SpanID regardless. I briefly considered just doing it along the way, but had enough change going on and didn't bother. It doesn't seem like we gain a whole lot by making up a separate random string, vs. if we make it match, we can automatically correlate a ServerID to a trace for "free" which could come in handy when going through logs and such. Low conviction on that though, wanted to ask and now's my chance. 😁

This data allows the UI to track where the "lazy" workloads came from so we can tie them back to the call site that installed them (i.e. `withExec`). Signed-off-by: Alex Suraci <[email protected]>

vito added this to the v0.11.1 milestone Apr 9, 2024

vito requested review from jedevc and sipsma April 9, 2024 21:02

vito force-pushed the otel-llb branch from 0aca5f2 to a60f779 Compare April 10, 2024 01:46

jedevc reviewed Apr 10, 2024

View reviewed changes

jedevc approved these changes Apr 10, 2024

View reviewed changes

vito force-pushed the otel-llb branch from a60f779 to 51efe32 Compare April 11, 2024 15:20

telemetry: record an API call's LLB op digests

ebd264b

This data allows the UI to track where the "lazy" workloads came from so we can tie them back to the call site that installed them (i.e. `withExec`). Signed-off-by: Alex Suraci <[email protected]>

vito force-pushed the otel-llb branch from 51efe32 to ebd264b Compare April 11, 2024 16:38

vito merged commit 2751226 into dagger:main Apr 11, 2024
41 of 43 checks passed

vito deleted the otel-llb branch April 11, 2024 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

telemetry: record an API call's LLB op digests #7068

telemetry: record an API call's LLB op digests #7068

vito commented Apr 9, 2024 •

edited

Loading

jedevc Apr 10, 2024

vito Apr 10, 2024

sipsma Apr 10, 2024

vito Apr 11, 2024

telemetry: record an API call's LLB op digests #7068

telemetry: record an API call's LLB op digests #7068

Conversation

vito commented Apr 9, 2024 • edited Loading

jedevc Apr 10, 2024

Choose a reason for hiding this comment

vito Apr 10, 2024

Choose a reason for hiding this comment

sipsma Apr 10, 2024

Choose a reason for hiding this comment

vito Apr 11, 2024

Choose a reason for hiding this comment

vito commented Apr 9, 2024 •

edited

Loading