Skip to content

SOLR-17432: Allow OTel Java Agent #2687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dsmiley
Copy link
Contributor

@dsmiley dsmiley commented Sep 2, 2024

https://issues.apache.org/jira/browse/SOLR-17432

Admittedly there's a lot going on here; the PR is more of a refactoring / minor improvements that also happens to support the Otel agent too, as it's a small aspect. I can separate this PR. A number of things were gnawing at me, which I started to deal with.

API Change: TracerConfigurator subclasses now create an OpenTelemetry not a Tracer. Since we depend on GlobalOpenTelemetry, if a configurator were to return some custom Tracer thing (independent of an OpenTelemetry), it's not clear how we would install it into a GlobalOpenTelemetry. Maybe possible (?) but doesn't seem a good direction. We could stop our use of GlobalOpenTelemetry easily as it's only used in one spot that could actually get the Tracer from elsewhere. Maybe we should do that anyway, but nonetheless installing it is something we should continue to do.

Refactorings / changes:

  • removed all tracing config from MiniSolrCloudCluster
  • SolrTestCase resets tracing now
  • Only TracerConfigurator should call GlobalOpenTelemetry.set (removed from SimplePropagator & OtelTracerConfigurator) and it does so in exactly one spot.
  • Tests examining spans now use OpenTelemetryRule. Removed CustomTestOtelTracerConfigurator.java

todo: test Custom TracerConfigurator ?
Do we even have a test actually using OtelTracerConfigurator.java?
Admittedly there's no test here for the Agent approach. Would need a bats test to do right, and it'd need to go download a 20MB jar. I tested it manually when running a test (to debug a test) and I have used it in Solr 9.

Refactorings:
* removed all tracing config from MiniSolrCloudCluster
* SolrTestCase resets tracing
* Only TracerConfigurator should call GlobalOpenTelemetry.set (removed from SimplePropagator & OtelTracerConfigurator)
* TracerConfigurator subclasses create a OpenTelemetry not Tracer.
* Tests examining spans now use OpenTelemetryRule. Removed CustomTestOtelTracerConfigurator.java

todo: test Custom TracerConfigurator ?
}

try {
GlobalOpenTelemetry.set(otel); // throws IllegalStateException if already set
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the one spot where we call this. Previously it was in multiple places which was confusing and hard to reason about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe out of scope but I feel like this whole setting of GlobalOpenTelemetry.set(otel) should sit in its own function called loadGlobalOpenTelemetry and be called as early as possible as per its documentation maybe inside the constructor of CoreContainer. Then on initialization of Solr core container, to set it's tracer just call TraceUtils.getGlobalTracer(). But also this whole file is called TracerConfigurator when a lot of work is configuring OTel then just returning a static global tracer. This same logic is still used for metrics as it uses the same Java Agent and OTLP exporter from the module not just Tracer loading.

// skips needless ExecutorUtil.addThreadLocalProvider below
if (otel == OpenTelemetry.noop()) return otel.getTracer("solr");
} catch (IllegalStateException e) {
log.info("GlobalOpenTelemetry was already initialized (a Java agent?); using that");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that its common tests will log "OpenTelemetry loaded with simple propagation only." but then log immediately after "GlobalOpenTelemetry was already initialized (a Java agent?); using that" if the test uses more than one CoreContainer (Solr node). I so wish GlobalOpenTelemetry had a way to know if it's been set already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible hacky work-around would be to get() initially and check for noop. If not noop, do nothing (keep current impl). If noop, create one according to our logic and then call resetForTest() then set(). Hacky because we'd be calling resetForTest in a production scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. if I remember correctly this is why I created CustomTestOtelTracerConfigurator so tests that race to set this would not fail. it felt awkward to have to catch IllegalStateException for problems that are triggered only via testing code.
I also remember there was an issue with checking for noop? there were some docs that were stating you should not check if the impl is noop, but I can't find a reference now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coming back to this. not sure what your plans are, but I will reshare this approach I had in the SimplePropagator
https://github.com/apache/solr/pull/2687/files#diff-e61b6fa689b156777ba2ecb37be366c4c312199302a2c90425acb460dbdb3f0eL57 - I think using a flag to keep track of status works to get around the issue of multiple init calls.

}
return TraceUtils.getGlobalTracer();

ExecutorUtil.addThreadLocalProvider(new ContextThreadLocalProvider());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now called in exactly one spot, which I find easier to read than before

return autoConfigOTEL(loader);
/** Initializes {@link io.opentelemetry.api.GlobalOpenTelemetry} and returns a Tracer. */
public static synchronized Tracer loadTracer(SolrResourceLoader loader, PluginInfo info) {
// synchronized to avoid races in tests starting Solr concurrently setting GlobalOpenTelemetry
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't personally encounter an issue here but saw existing comments (from Alex?) that suggested to me making this synchronized would be helpful. I don't think there's a true issue; ultimately GlobalOpenTelemetry will be set exactly once so it doesn't much matter if others try and race to do so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure synchronized is needed. the trouble I had seen were from tests trying to race to init this and failing because otel was already set. I think it was happening when setting up a cluster with a few nodes, but I need to revisit some of this to check.

}

protected abstract Tracer getTracer();
protected abstract OpenTelemetry createOtel();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the API change

@@ -49,14 +48,13 @@ public OtelTracerConfigurator() {
}

@Override
public Tracer getTracer() {
return TraceUtils.getGlobalTracer();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A needless GlobalOpenTelemetry reference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, could you clarify what this means?

prepareConfiguration(args);
AutoConfiguredOpenTelemetrySdk.initialize();
public OpenTelemetry createOtel() {
return AutoConfiguredOpenTelemetrySdk.builder().build().getOpenTelemetrySdk();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notice the change to not call initialize. That method arranges for GlobalOpenTelemetry to be set. I only want that set in TracerConfigurator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this diff would have been easier to read if the 2 methods were not swapped.

@@ -34,8 +34,6 @@
<int name="connTimeout">${connTimeout:15000}</int>
</shardHandlerFactory>

<tracerConfig name="tracerConfig" class="org.apache.solr.opentelemetry.CustomTestOtelTracerConfigurator" />
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this class could return but only for a test explicitly testing that tracerConfig works and in a separate solr.xml for exactly that test as well (not for the other tracing tests). Not to be used for other tests or trying to collect spans for examination.

import org.junit.Test;

public class BasicAuthIntegrationTracingTest extends SolrCloudTestCase {

private static final String COLLECTION = "collection1";

@ClassRule public static OpenTelemetryRule otelRule = OpenTelemetryRule.create();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to write an equivalent of OpenTelemetryRule when I found that it already exists :-)

FYI note it has docs that it doesn't work in a ClassRule yet I see it works here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty cool, I did not know this existed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this thing was strange; didn't need to exist, at least not as it existed

This way, if we're running a test with tracing, thus Span::isRecording is true, it *will* record.  Previously it usually wouldn't unless the test called resetRecordingFlag.  No need now.
Copy link
Contributor

@janhoy janhoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the bandwidth to do a full review and test now. But I agree on the goals and according to PR description, this is a nice evolution of our API to allow more ways to configure tracing and get more spans from various libraries free.

I'd perhaps like to see some screenshot of a Jaeger trace with the existing otel module before and after the change, to verify that we don't get unintentional changes? I suppose, if you use an agent, our manual instrumentation will still kick in, in addition?

@dsmiley
Copy link
Contributor Author

dsmiley commented Sep 15, 2024

When I run the agent, I provide an Otel properties file with: (an excerpt)

otel.metrics.exporter=none
otel.logs.exporter=none
otel.instrumentation.common.default-enabled=false
otel.instrumentation.opentelemetry-api.enabled=true
otel.instrumentation.opentelemetry-instrumentation-annotations.enabled=true
otel.instrumentation.aws-sdk.enabled=true
otel.instrumentation.java-http-client.enabled=true

Notice I disable auto-instrumentation by default and selectively enable the specific instrumentations useful to me. This works with Solr's manual instrumentation of itself as well because of enabling "opentelemetry-api". All this is documented quite well in Otel.

I don't see how this PR would have unintentional changes on existing traces/spans since they are tested and the PR doesn't change them. Any way, I'm happy to share a screenshot later.

This PR misses documentation to let users know they don't have to enable the opentelemetry module just to use opentelemetry :-). (an aside -- nor do we tell users they don't have to have an HDFS cluster to make use of the HDFS Directory). Also I'd like to somewhere document a tip on how to use tracing during development / debugging of a unit test.

@dsmiley
Copy link
Contributor Author

dsmiley commented Sep 15, 2024

I'm a little unhappy with keeping TracerConfigurator named this while having it return an OpenTelemetry. Really; we're embracing OpenTelemetry here. I would love to use OpenTelemetry for metrics eventually, to be honest.

@janhoy
Copy link
Contributor

janhoy commented Sep 17, 2024

I just saw SOLR-17455 coming in. While mentioning OpenTracing is correct for 9.x, for 10.0 it is gone, so perhaps when you add refGuide docs for this PR, you can also s/OpenTracing/OpenTelemetry/ in the guide?

Copy link

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the [email protected] mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Nov 17, 2024
@dsmiley dsmiley added the exempt-stale Prevent a PR from going stale label Jan 13, 2025
@dsmiley
Copy link
Contributor Author

dsmiley commented Apr 4, 2025

I very much want to get back to this for Solr 10... need to maybe reformulate what's happening into multiple things.
CC @mlbiscoc

@stillalex
Copy link
Member

@dsmiley is this something you are still looking at? I will try to find time to review over the next weeks.

@@ -54,19 +46,6 @@ public class SimplePropagator implements TextMapPropagator {

private static final AtomicLong traceCounter = new AtomicLong(0);

private static volatile boolean loaded = false;

public static synchronized Tracer load() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 moving this to TracerConfigurator is a good change.

@stillalex
Copy link
Member

+1 this looks good to me. when you have some time, could you rebase? curious to see what errors are left.
I think the open question is the issue of multiple init calls causing that OTEL exception during tests.

@dsmiley
Copy link
Contributor Author

dsmiley commented May 1, 2025

I'll come back to this today/tomorrow. I really appreciate the close look @stillalex :-)
My biggest question is really scoping. This PR is a bit all over the place; it will take some work to break it into more meaningful chunks. I'll think about this further when I look closer again.

# Conflicts:
#	solr/test-framework/src/java/org/apache/solr/SolrTestCase.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants