Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

javaagent Kotlin coroutines usage (with ktor client), potential bug with delay and contexts. #12611

Open
amccague opened this issue Nov 11, 2024 · 1 comment
Labels
needs author feedback Waiting for additional feedback from the author

Comments

@amccague
Copy link

amccague commented Nov 11, 2024

Whilst trying to get to grips on the expected usage of OTEL using the agent and coroutines I felt it was ambiguous the intended usage. And I'm not clear how and when the out-of-the-box javaagent behaviour kicks in. It appears like it's hard to escape peppering in the OTEL additions which looks easy for people to accidentally miss (which is not ideal!).

Let's assume this is always how the parent and root span is defined:

val span = tracer.spanBuilder("demo-client")
    .setNoParent()
    .startSpan()

And the Ktor HttpClient is defined like this, I'll omit the request prepare statement as it's not relevant (as far as I know):

HttpClient(Java) {
    install(ContentNegotiation) {
        jackson()
    }
}

Both of the following leads to new trace IDs, not using my defined span as the parent:

try {
    span.makeCurrent().use { // Presuming the scope should be closed before the span is ended?
        withContext(Dispatchers.IO) { // Expecting the agent to add the OTEL context agent here?
            post.execute { response ->
                response.body<DummyDto>()
            }
            post.execute { response ->
                response.body<DummyDto>()
            }
        }
    }
} finally {
    span.end()
}
try {
    span.makeCurrent().use { // The `withContext` isn't strictly required anyway - the ThreadLocal appears to not be propagated into the ktor-client scopes out of the box.
        post.execute { response ->
            response.body<DummyDto>()
        }
        post.execute { response ->
            response.body<DummyDto>()
        }
    }
} finally {
    span.end()
}

If I'm not mistaken this looks like the intended use, which implies there's nothing out of the box with coroutines:
Works as expected, traceIds are correct with this example

try {
    span.makeCurrent().use { // You would still want to load the ThreadLocal context, e.g. for loggers reading ThreadLocal MDC?
        withContext(span.asContextElement()) { // Could have used the OTEL context because the span was made current.
            post.execute { response ->
                response.body<DummyDto>()
            }
            post.execute { response ->
                response.body<DummyDto>()
            }
        }
    }
} finally {
    span.end()
}

In the prior example where the withContext(Dispatchers.IO) { .. } was used, whenever withContext is used we must + the coroutine context e.g.:
Works as expected

try {
    span.makeCurrent().use {
        withContext(Dispatchers.IO + Context.current().asContextElement()) {
            post.execute { response ->
                response.body<DummyDto>()
            }
            post.execute { response ->
                response.body<DummyDto>()
            }
        }
    }
} finally {
    span.end()
}

And to throw a spanner in the works if I add a delay in here it goes wrong again:
Does not work as expected

try {
    span.makeCurrent().use {
        withContext(Context.current().asContextElement()) {
            delay(1000)
            post.execute { response ->
                response.body<DummyDto>()
            }
            post.execute { response ->
                response.body<DummyDto>()
            }
        }
    }
} finally {
    span.end()
}

However if I add the Dispatchers.IO it repairs it:
Works as expected

try {
    span.makeCurrent().use {
        withContext(Dispatchers.IO + Context.current().asContextElement()) {
            delay(1000)
            post.execute { response ->
                response.body<DummyDto>()
            }
            post.execute { response ->
                response.body<DummyDto>()
            }
        }
    }
} finally {
    span.end()
}

I can also isolate the delay but it's getting very noisy at this point:
Works as expected

try {
    span.makeCurrent().use {
        withContext(Context.current().asContextElement()) {
            withContext(Context.current().asContextElement()) {
                delay(1000)
            }
            post.execute { response ->
                response.body<DummyDto>()
            }
            post.execute { response ->
                response.body<DummyDto>()
            }
        }
    }
} finally {
    span.end()
}

I'm hoping some of this usability is the result of some subtle bug or something I've missed.

Using 2.9.0 agent with

implementation(platform("io.opentelemetry.instrumentation:opentelemetry-instrumentation-bom:2.9.0"))
implementation("io.opentelemetry:opentelemetry-api")
implementation("io.opentelemetry:opentelemetry-extension-kotlin")
@laurit
Copy link
Contributor

laurit commented Nov 18, 2024

I didn't try to run and debug the snippets you provided and I don't know much about kotlin so the following is just a guess. Firstly when you open a scope with makeCurrent (this is what updates the thread local context) you must ensure that the scope is closed and that the scope is closed on the same thread it was opened and that the scopes are closed in the correct order. Failure to do so will break things. This means that you should not use span.makeCurrent().use {...} inside an suspending method around another call to a suspending method.
Have a look at https://github.com/open-telemetry/opentelemetry-java/blob/main/extensions/kotlin/src/main/java/io/opentelemetry/extension/kotlin/KotlinContextElement.java this is what the agent uses to synchronize the otel context from kotlin coroutine context to thread local. My understanding is that as the synchronization happens from coroutine context to thread local you should update the otel context in coroutine context and not update the thread local directly.
If you believe you have found a bug then please provide a minimal application that reproduces the issue, this will increase the chances that someone will take a look at your issue.

@laurit laurit added the needs author feedback Waiting for additional feedback from the author label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs author feedback Waiting for additional feedback from the author
Projects
None yet
Development

No branches or pull requests

2 participants