Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to propagate complete traceparent as a header when using CloudEvents #7699

Open
KrylixZA opened this issue Apr 15, 2024 · 9 comments
Open
Labels
kind/bug Something isn't working

Comments

@KrylixZA
Copy link

KrylixZA commented Apr 15, 2024

In what area(s)?

/area runtime
/area observability
/area pubsub

What version of Dapr?

1.13.2

Expected Behavior

When publishing CloudEvents through the Dapr client (in my case, .NET client), I expect Dapr to propagate my traceparent property through the CloudEvent over a PubSub broker to the next service in my processing flow. Ultimately (as I understand it), the parent-child relationship for traces should be such that it gets propagated all the way through and can trace all the way back to the original call. Specifically, I am using Kibana to visualise my traces. Looking at the W3C tracing standard, defined as:

{version}-{traceId}-{spanId}-{traceFlags}

It appears Zipkin and Application Insights are tying all traces together using the {traceId} portion of the standard. However, Kibana matches services together by tying the {spanId}'s together. Thus, for tracing to work on Kibana, the flow would need to look as follows:

Service 1:
  TraceId: 00-5f733a71c9421f23478dfd8e99d2fc78-d8e9c8238dd92306-01
  SpanId: d8e9c8238dd92306
  ParentId: (null) # let's say for arguments sake it's consuming a raw event from a legacy system.
  ParentSpanId: 0000000000000000

CloudEvent:
  traceId: 00-5f733a71c9421f23478dfd8e99d2fc78-91e795ad9725ad37-01
  traceparent: 00-5f733a71c9421f23478dfd8e99d2fc78-d8e9c8238dd92306-01

Service 2:
  TraceId: 00-5f733a71c9421f23478dfd8e99d2fc78-43ca06c0ca51efcc-01
  SpanId: 43ca06c0ca51efcc
  ParentId: 00-5f733a71c9421f23478dfd8e99d2fc78-d8e9c8238dd92306-01
  ParentSpanId: d8e9c8238dd92306

Full disclosure: I am not completely sure there is a bug here. I am opening this issue to get clarity of what the expected behaviour of the distributed tracing using Pub/Sub is. My gut feel says that each step creating it's own span but the {traceId} portion remains consistent is the right way for these things to work and thus is an issue with the distributed trace visualiser in Kibana.

Actual Behavior

To be clear, this is not an issue in .NET SDK but in the runtime where Dapr is automatically setting values for the traceparent and traceid fields in the CloudEvent payload and then subsequently passing them along as HTTP Headers when the "subscribing" application picks up an event off a topic.

There are 2 scenarios at play here. Potentially label them as 2 bugs:

Scenario 1: Vanilla CloudEvent

With no alterations to the standard CloudEvent class provided in the .NET SDK, Dapr sets a related but not exact match of the traceparent property. I am happy that the traceId may be set to that which the sidecar uses for it's particular span as this would theoretically create a richer distributed trace, but at the moment that's all it does. For example:

Service 1:
  TraceId: 00-5f733a71c9421f23478dfd8e99d2fc78-d8e9c8238dd92306-01
  SpanId: d8e9c8238dd92306
  ParentId: (null)
  ParentSpanId: 0000000000000000

CloudEvent:
  traceid: 00-5f733a71c9421f23478dfd8e99d2fc78-12b811d845546738-01
  traceparent: 00-5f733a71c9421f23478dfd8e99d2fc78-12b811d845546738-01 # Note this is not traceId of Service 1 but the same as the traceId suggesting it's the Dapr traceId not my application.

Service 2:
  TraceId: 00-5f733a71c9421f23478dfd8e99d2fc78-43ca06c0ca51efcc-01
  SpanId: 43ca06c0ca51efcc
  ParentId: 00-5f733a71c9421f23478dfd8e99d2fc78-729f7ada82548e3f-01 # Note this is not the traceId of Service 1, nor is it the traceparent from the CloudEvent - it got lost somewhere in the middle
  ParentSpanId: 729f7ada82548e3f

Scenario 2: Custom CloudEvent with traceparent explicitly set

In this scenario, I've created an extension of the CloudEvent class and explicitly added and set the traceparent property to that of my current traceId. This effectively forces the CloudEvent to correct traceparent value. However, it still does not make it to service 2. For example:

Service 1:
  TraceId: 00-5f733a71c9421f23478dfd8e99d2fc78-d8e9c8238dd92306-01
  SpanId: d8e9c8238dd92306
  ParentId: (null)
  ParentSpanId: 0000000000000000

CloudEvent:
  traceid: 00-5f733a71c9421f23478dfd8e99d2fc78-12b811d845546738-01
  traceparent: 00-5f733a71c9421f23478dfd8e99d2fc78-d8e9c8238dd92306-01 # Note this is now set to the traceId of Service 1

Service 2:
  TraceId: 00-5f733a71c9421f23478dfd8e99d2fc78-43ca06c0ca51efcc-01
  SpanId: 43ca06c0ca51efcc
  ParentId: 00-5f733a71c9421f23478dfd8e99d2fc78-729f7ada82548e3f-01 # Note this is not the traceId of Service 1, nor is it the traceparent from the CloudEvent - it got lost somewhere in the middle
  ParentSpanId: 729f7ada82548e3f

Steps to Reproduce the Problem

This can be easily repeated on any Pub/Sub implementation that uses CloudEvents. If using C#, be sure to add the following code to the subscriber controllers:

var traceId = Activity.Current?.Id;
var spanId = Activity.Current?.SpanId;
var parentTraceId = Activity.Current?.ParentId;
var parentSpanId = Activity.Current?.ParentSpanId;

_logger.LogDebug("Got TraceId: {TraceId}", traceId);
_logger.LogDebug("Got SpanId: {SpanId}", spanId);
_logger.LogDebug("Got ParentTraceId: {ParentTraceId}", parentTraceId);
_logger.LogDebug("Got ParentSpanId: {ParentSpanId}", parentSpanId);

Release Note

RELEASE NOTE:

@KrylixZA KrylixZA added the kind/bug Something isn't working label Apr 15, 2024
@KrylixZA
Copy link
Author

The more I think about this issue, the more it feels like a Kibana problem not a Dapr problem. The fact that it's trying to match the {spanId} of Service 1 to the {parentSpanId} of Service 2 seems insane given the {traceId} value is consistent across all services. It feels to me like Dapr is doing things exactly right.

@KrylixZA
Copy link
Author

This particular support ticket on the Elastic Support Discussions appears to zoom in on exactly what's going: Missing transactions in the traceview after a call to DAPR.

It's basically confirming my suspicion. Elastic is trying to connect spans together by looking at the parent.id field of Service 2 and trying to tie it Server 1's span.id field.

Given in any typical Pub/Sub scenario, there are 2 Dapr sidecar spans in the middle there, that would mean the {spanId} does not match and it cannot draw the full trace.

So if I understand it correctly, what happens is the following:

Service 1: defines an entirely new trace and uses that when calling the Dapr client to publish an event
Service 1 Dapr sidecar: defines a new span with the same {traceId} when publishing the CloudEvent

Service 2 Dapr sidecar: consumes the CloudEvent and defines a new span with the same {traceId} based on the CloudEvent and propagates that as a header to Service 2
Service 2: accepts the trace from the sidecar, defines a new span with the same {traceId} but different {spanId}

... repeat for each service in the processing flow

Because Kibana is not connecting the sidecar spans it's unable to connect the sidecars together to ultimately draw the bigger picture.

I wonder if we have some missing config that is not pushing our traces to Kibana. Still, it seems wrong that Kibana is trying to do a parent-child match on the {spanId} value rather than {traceId}.

@KrylixZA
Copy link
Author

I did a Google search. Turns out Kibana doesn't support Zipkin traces. I will need to try out OTLP on Dapr instead. Don't have high hopes though :(

@KrylixZA
Copy link
Author

I tried using the OTLP exporter. Definitely seeing more telemetry on my service traces now that I wasn't seeing before. However, still not seeing a distributed trace covering all spans.

In an ironic twist, when I click on the "View full trace" button, it takes me to a trace on Service 1, but doesn't render the complete trace. So it obviously has the ability to connect the spans together, it's just not doing it. Seems like a bug in Kibana.

@KrylixZA
Copy link
Author

Hey all. This is definitely not a bug in Dapr but a problem in Elastic's APM solution. Keeping this issue open as I've raised a support ticket with them and shared it with them - maybe we'll transfer it to their GitHub space in the near future.

@KrylixZA
Copy link
Author

KrylixZA commented May 7, 2024

We've dug further into this. We're seeing similar behaviours (although not quite as bad) on other APM tools. A somewhat related issue as surfaced by Zipkin is #5027.

Unfortunately this issue closed as it was stale but we are seeing very similar behaviours from our system.

@KrylixZA
Copy link
Author

KrylixZA commented May 7, 2024

So if I had to break it down, there are two problems at hand here:

  1. When using CloudEvents, it seems that some APM tools show the distributed traces as disconnected, distinct spans traces and are not linked together correctly.
  2. When using Dapr state management, the ordering of the spans in the distributed trace is lost.

@sleom
Copy link

sleom commented May 19, 2024

The behavior of Dapr is not as expected in my scenario. I’m utilizing RabbitMQ PubSub along with the .NET C# Dapr SDK. Despite overriding both the traceid and traceparent using metadata parameters (cloudevent.traceid and cloudevent.traceparent, respectively), the generated cloudevent contains identical values for both fields (equals to provided in 'cloudevent.traceid'), which should match the ones I provided.

@KrylixZA
Copy link
Author

Hi @sleom

Thanks for sharing. This is the behaviour I see too. Even if one explicitly sets the value of the traceparent field in the cloud event, what is ultimately sent to your subscriber app is not the same value.

I went down such a rabbit hole with this I may have now confused myself. To an extent, the behaviour is understandable. Each operation can have its own span. Thus it would be different between what was in the cloud event and what reaches your subscriber application. It's easy to explain that as the dapr sidecar's span as it consumes the event off the topic before making the HTTP request to your app.

That said, I'm not totally sure if that's what it should be doing. In my specific scenario with Elastic APM, the trace is broken because of it.

Maybe there is indeed a bug here that needs to be worked out. @yaron2 what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants