Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

telemetry: record an API call's LLB op digests #7068

Merged
merged 1 commit into from
Apr 11, 2024
Merged

Conversation

vito
Copy link
Contributor

@vito vito commented Apr 9, 2024

This data allows the UI to track where lazy operations came from so we can tie them back to the call site that installed them (i.e. withExec). This visualization angle is super useful when you're trying to gauge the "cost" of API calls.

Something like this (note the "shadow bars" - UI subject to change):

image

NOTE: As a prerequisite, I moved the ftp_proxy hackery back to Container.WithExec instead of doing it "just-in-time" in the Buildkit client. Otherwise the span ends up with a different digest than what we return, so they can't be correlated at all. The switch from Progrock => OpenTelemetry allows us to do this, I believe; we couldn't before because we had to pass the Progrock parent vertex ID, but the OpenTelemetry equivalent is handled for us by Buildkit itself.

(edit - side note: it seems like we could potentially remove the ftp_proxy hack entirely if we replace ServerID with the OpenTelemetry trace + span ID.)

@vito vito added this to the v0.11.1 milestone Apr 9, 2024
@vito vito requested review from jedevc and sipsma April 9, 2024 21:02
Comment on lines -309 to -312
// If we already solved for this execop in this dagger session, use
// it's uncached metadata to preserve the same vertex digest.
// This results in buildkit's solver de-duping the op, instead of
// the scheduler's edge merge.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was originally added because of #6505.

Because we're not using buildkit's progress stream this shouldn't be an issue anymore right?

We still would be doing more edge merging again because of this - since these are likely what cause inconsistent graph state issues, we might see more of those (though that would definitely help debugging it lol).

If there's an easy way to keep this hack in (I like the explicitness of it, instead of hiding the details in the solver), I'd prefer that, but not a blocker.

(edit - side note: it seems like we could potentially remove the ftp_proxy hack entirely if we replace ServerID with the OpenTelemetry trace + span ID.)

I like this - this is the best of all worlds - we get this feature, and we get rid of our nasty hack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this - this is the best of all worlds - we get this feature, and we get rid of our nasty hack.

Yeah, I really want to do this, but it probably makes the most sense to do it after #6914 and #6916 (cc @sipsma) since at that point we for sure don't need the ParentClientIDs value, and solely have the ServerID to figure out. At that point theoretically we just stop using a random ID for it and use something analogous to $traceparent instead. (Carefully pointing out we need trace ID + span ID, not just trace ID, to handle nesting.)

We'd be adding a hard dependency on OpenTelemetry at this layer, but that seems worth the trade-off to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have invested 0 time looking myself because on mobile atm, but how is the otel trace id propagated? Not necessarily opposed to using it as the server id but if we can just propagate the server id the same way we propagate the trace id that might be even better (just in the interest of not creating too much entanglement).

Also, the approach I'm adding in the CA cert PR of using a custom Executor rather than the shim should in the long term let us migrate everything in the shim and also be another route of avoiding the ftp proxy hack. Just noting for completeness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sipsma We can almost certainly use the new custom Executor if we want to yeah. Right now Buildkit propagates it for us with its OpenTelemetry integration - somewhere around here I suppose (I never had to look).

FWIW, it seems worth considering making ServerID == TraceID+SpanID regardless. I briefly considered just doing it along the way, but had enough change going on and didn't bother. It doesn't seem like we gain a whole lot by making up a separate random string, vs. if we make it match, we can automatically correlate a ServerID to a trace for "free" which could come in handy when going through logs and such. Low conviction on that though, wanted to ask and now's my chance. 😁

This data allows the UI to track where the "lazy" workloads came from so
we can tie them back to the call site that installed them (i.e.
`withExec`).

Signed-off-by: Alex Suraci <[email protected]>
@vito vito merged commit 2751226 into dagger:main Apr 11, 2024
41 of 43 checks passed
@vito vito deleted the otel-llb branch April 11, 2024 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants