Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a means to denote preference for Link vs Span in given Context #4054

Open
hibachrach opened this issue May 19, 2024 · 10 comments
Open
Labels
spec:trace Related to the specification/trace directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted

Comments

@hibachrach
Copy link

hibachrach commented May 19, 2024

First time contributor here--please let me know if there's anything I need to correct procedurally!

What are you trying to achieve?

I'm hoping to extend the API description for Context as it pertains to Traces (or perhaps change the API description for Traces?)

Here's the problem I'm trying to solve:

Consider a user flow in which a file conversion service accepts an HTTP request. The user's intent is to upload a file to convert it from one file type to another (e.g. .pdf -> .png). For infrastructure performance reasons, this work is enqueued in a messaging system/background job processing system (e.g. Sidekiq in the Ruby ecosystem). As part of handling the same request, a similar message/background job is sent to the same system. This flow is depicted below:

graph TD
    A[Client] -->|Requests| B[Request Handler]
    B -->|Enqueues| C[Background Job for File Conversion]
    B -.->|Enqueues| D[Background Job for Analytics]
    style D stroke-dasharray: 5 5
Loading

From a user's perspective, only the solid-line nodes in the above graph are important. The dashed lines are merely an implementation detail. Similarly, only the Spans related to the solid-line nodes are, to me as the engineer monitoring this system, significant. If the analytics-related message is not complete for another hour, that does not matter to me very much.

This maps neatly to the notion of child Spans & Links: for a step in a flow where it is all really part of one high-level request, one would hope to have that represented as a child Span. It is truly just part of the overall Trace (or, at least, what I would expect a Trace to capture). On the other hand, the analytics message is relatively detached. While a Link could point me to its source, it is not something that is part of the user journey.

What did you expect to see?

Ideally, this is something that you'd be able to hint at via Context. Here's what the above would look like via pseudocode:

function requestHandler():
  Context.current.setPropagationStyleHintTo('Child'):
    BackgroundJobSystem.enqueue(FileConversionTask)
  BackgroundJobSystem.enqueue(AnalyticsTask)

This way, the Otel instrumentation for BackgroundJobSystem would know how to relate the next Span it creates to the current Span. This hint would reset within further nested spans (e.g. any tasks that FileConversionTask may enqueue would, by default, be connected via Links.

Some open questions that I don't have the answer to just yet:

  • What makes sense as the default propagation style? My guess is this would be defined by the instrumentation library. E.g. Sidekiq's uses Links by default.
  • Should an instrumentation library be permitted to disregard this hint?
    • I would think yes: in the Sidekiq example, there would necessarily be the creation of both the PRODUCER and CONSUMER Spans and you'd want it to apply to both. So the instrumentation library needs to be given some flexibility in interpretation. Also, making it a requirement would necessarily be a breaking change I would think?

Additional context.

I was directed here from open-telemetry/opentelemetry-ruby-contrib#991 as a contributor thought this may make sense more generally.

@hibachrach hibachrach added the spec:trace Related to the specification/trace directory label May 19, 2024
@dmathieu
Copy link
Member

dmathieu commented May 21, 2024

Something similar is already being done by instrumentation libraries.
For example, the WithPublicEndpoint option in Go:
https://github.com/open-telemetry/opentelemetry-go-contrib/blob/a91e60beace7c1d96302fe359050d705c4723e72/instrumentation/net/http/otelhttp/handler.go#L140-L146

Ruby has this proposal open for the Rack instrumentation: open-telemetry/opentelemetry-ruby-contrib#837

It seems your pseudo-code proposal would be the same as:

link = OpenTelemetry::Trace::Link.new(OpenTelemetry::Trace.current_span.context)
span = tracer.start_root_span(
  "FileConversion",
  with_parent: OpenTelemetry::Context.empty,
  links: [link]
)

It's a bit more verbose, but nothing a simple helper method can't resolve. Does it really need a spec change?

@hibachrach
Copy link
Author

Hmm, perhaps I'm misunderstanding, but I don't think your code snippet isn't equivalent to what I'm describing. Your code allows one to sever the parent-child relationship between Spans that is present by default, but it doesn't allow for the opposite (converting what would be a Link by default into a parent-child relationship).

@trask
Copy link
Member

trask commented May 21, 2024

cc @open-telemetry/semconv-messaging-approvers

@dyladan dyladan added the triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted label May 21, 2024
@pyohannes
Copy link
Contributor

@hibachrach If I understand correctly, you enqueue messages (or background jobs), and at enqueue time you want to specify whether the span for dequeuing the message (or running the background job) is created as a child or a link of the current span.

This effectively requires two parts:

  1. A standardized way for propagating this decision or hint. For W3C, this might be part of the tracestate header.
  2. On instrumentation side, producers need to populate this hint in the outgoing context, consumers need to extract and honor it.

This is definitely a valid use case, however, I'm wondering whether the need is widespread enough to merit a specification change and implementations in all languages. This issue is a great place to collect use cases and gauge interest.

Currently and without this capability, this needs to be solved on instrumentation side. In your case, the (sidekiq) instrumentation library could provide a configuration option to specify for which jobs you want to have parent/child relationships and for which you want to have links. I know of some instrumentation libraries (for example AWS Lambda for .NET) where such an option exists on a global level and allows switching between parent/child and link relationships.

@lmolkova
Copy link
Contributor

lmolkova commented May 22, 2024

Both static and context-based configuration for instrumentation are problematic
- Static configuration is not enough since some jobs are parented, others are linked.
- Context-based - there could be other spans nested under Context.current.setPropagationStyleHintTo('Child') which would be affected whether application author wants it or not

This sounds like a visualization concern rather than instrumentation/data collectopm issue.

@hibachrach could you please elaborate on

From a user's perspective, only the solid-line nodes in the above graph are important. The dashed lines are merely an implementation detail. Similarly, only the Spans related to the solid-line nodes are, to me as the engineer monitoring this system, significant. If the analytics-related message is not complete for another hour, that does not matter to me very much.

Could you please help me understand why is it important to use links (and not parent-child) for some of the background jobs:

  • is it a visual noise problem?
  • is it related to billing?
  • does it skew some additional analysis like e2e latency?
  • anything else?

@hibachrach
Copy link
Author

@pyohannes

Currently and without this capability, this needs to be solved on instrumentation side. In your case, the (sidekiq) instrumentation library could provide a configuration option to specify for which jobs you want to have parent/child relationships and for which you want to have links

Yes, this was my first thought as well--see open-telemetry/opentelemetry-ruby-contrib#991

@lmolkova

Both static and context-based configuration for instrumentation are problematic

  • Static configuration is not enough since some jobs are parented, others are linked.
  • Context-based - there could be other spans nested under Context.current.setPropagationStyleHintTo('Child') which would be affected whether application author wants it or not

So context based solution would not reset it for the whole subtree, but merely for the immediate child. At the discretion of the instrumentation library author for that child node, it could continue propagating. Otherwise, it would reset.

This sounds like a visualization concern rather than instrumentation/data collectopm issue.

[...]

Could you please help me understand why is it important to use links (and not parent-child) for some of the background jobs:

  • is it a visual noise problem?
  • is it related to billing?
  • does it skew some additional analysis like e2e latency?
  • anything else?

I would say it's all of those items you listed:

  • It does add a lot of noise to a trace, though I know that wouldn't be a big issue with our tracing SaaS product of choice (Honeycomb)
  • AFAIK, there's not a way generally to ensure both ends of a link get chosen if using sampling. This effectively makes links useless for us. This is the primary reason I think this makes sense to tackle at the instrumentation/collection level.
  • As of today, we aren't able to do analysis on total trace duration (merely span duration) but my hope is that we'd be able to in the future. And yes, when that day comes, having "extra" subtrees in the trace would hamper that analysis.

@dmathieu
Copy link
Member

So context based solution would not reset it for the whole subtree, but merely for the immediate child. At the discretion of the instrumentation library author for that child node, it could continue propagating. Otherwise, it would reset.

What if you want the setting to keep propagating, but the instrumentation author decided otherwise?

Yes, this was my first thought as well--see open-telemetry/opentelemetry-ruby-contrib#991

This issue is very relevant, and maybe it does entitle a specification change. But does it require an API change?
A specification change could be something that asks (official) instrumentation authors that their libraries may provide an option to propagate the parent span as a link rather than the default parent/child relationship.

cc @arielvalentin

@lmolkova
Copy link
Contributor

lmolkova commented May 23, 2024

thanks for the details @hibachrach

I do agree that all of these problems are good ones to solve, but I wonder if the proposed solution actually solves them in general case.

First, application need to customize parent/links relationships for specific spans in the system. It would improve the situation, in some case, but

  • visual noise can happen with parents or links
  • inconsistent sampling for links would force people to pick parent-child relationship and would increase the noise and skew e2e analysis
  • e2e latency would be a problem either way since only the specific application knows where the end is and which operations are fire-and-forget.

I'd suggest to tackle each of the problems individually and in a generic way (ask backends to improve links support/visualization, improve sampling with links, come up with ways to identify how to measure e2e latency, etc).

One more thing:

In many cases having links is not a choice, but a necessity - e.g. when pulling message from a queue, you don't know the context of the message(s) you're about to pull and can't use parent-child relationship. For the systems that don't have this problem, e.g. in-memory queues, parent-child may be a better default.

@github-actions github-actions bot added the triage:followup Needs follow up during triage label Sep 19, 2024
@svrnm svrnm added triage:deciding:needs-info Not enough information. Left open to provide the author with time to add more details and removed triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted triage:followup Needs follow up during triage labels Sep 30, 2024
@svrnm
Copy link
Member

svrnm commented Sep 30, 2024

@hibachrach is this still something you are interested in? please take a look at @lmolkova's comment

@hibachrach
Copy link
Author

I'm interested in the general problem but not married to any particular solution--if it would help to close this issue to focus on different manners of addressing it, feel free to do so.

Thanks for your checking in 😄

@github-actions github-actions bot added the triage:followup Needs follow up during triage label Oct 15, 2024
@trask trask added triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted and removed triage:followup Needs follow up during triage triage:deciding:needs-info Not enough information. Left open to provide the author with time to add more details labels Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:trace Related to the specification/trace directory triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted
Projects
None yet
Development

No branches or pull requests

7 participants