Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObservationValidator introduced in TestObservationRegistry with version 1.14 sometimes throws InvalidObservationException #5802

Open
mathiaskinzler opened this issue Jan 7, 2025 · 6 comments
Labels
feedback-provided waiting for feedback We need additional information before we can continue

Comments

@mathiaskinzler
Copy link

mathiaskinzler commented Jan 7, 2025

Describe the bug
We are using micrometer in a spring boot application and have tests in place for our micrometer trace instrumentation. After (implicitly) updating micrometer from 1.13.8 to 1.14.2, a few of our tests fail due to InvalidObservationException being thrown. Tracing this down, the reason is the newly introduced ObservationValidator.
Note that the issue is in the micrometer test jar, not in the micrometer productive code.

The error message is "Invalid stop: Observation 'xxx' has already been stopped"

The underlying reason why "stop" is tried to be executed "twice" is probably that MicrometerObservationListener (which is not in the test jar) does an observation.stop in it's doOnCancel method (the same as in it's doOnComplete). It seems that the spring boot reactive framework does not guarantee that doOnStop and doOnCancel are always called mutually exclusive. Note that the situation only occurs in a few test cases where the request is terminated by an Exception or using skipWhile on a Flux which seems to indicate that the onCancel is issued by the spring framework when disposing the request.

Environment

Spring boot 3.4.1 with managed micrometer version 1.14.2

  • Micrometer version 1.14.2
  • Micrometer registry dynatrace
  • OS: windows/linux
  • Java version: 17

To Reproduce
How to reproduce the bug:

Hard to provide a test case short of our application.

Expected behavior
Adding micrometer test must not break existing productive code. A quick solution might be to allow to create a TestObservationRegistry that does not include the ObservationValidator by default. Alternatively, allowing to pass an alternative Consumer&ltValidationResult&gt instead of throwInvalidObservationException might be an option so that validation results could be checked by the consuming tests.

Additional context
Add any other context about the problem here, e.g. related issues.

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Jan 7, 2025

Thank you for the issue!

TestObservationRegistry's ObservationValidator tries to validate instrumentation with the Observation API and throws an exception if it finds that something is incorrect. Only for tests, it will not disrupt your app in production. See more in the docs: https://docs.micrometer.io/micrometer/reference/observation/testing.html#micrometer-observation-validator

In your case: "Invalid stop: Observation 'xxx' has already been stopped" means that something calls stop twice in your app. Micrometer has zero control over it. Also, the output should include both stop calls with class names, method names, and line numbers (even clickable in your IDE). Are those lines coming from your code or from Reactor?

The underlying reason why "stop" is tried to be executed "twice" is probably that MicrometerObservationListener (which is not in the test jar) does an observation.stop in it's doOnCancel method (the same as in it's doOnComplete). [...]

There is no MicrometerObservationListener class in Micrometer. Is it from Reactor? Is this happening because you try to instrument code using Reactor and Micrometer or is this Reactor's built-in instrumentation? Some code examples would be great.

Adding micrometer test must not break existing productive code.

As far as I understand, it does not but it breaks your tests, right? (The test registry should not be used anywhere else other than tests.)

A quick solution might be to allow to create a TestObservationRegistry that does not include the ObservationValidator by default. Alternatively, allowing to pass an alternative Consumer&ltValidationResult&gt instead of throwInvalidObservationException might be an option so that validation results could be checked by the consuming tests.

We considered doing these but decided not to, see part of the discussion here: #5300
(This does not mean that we won't change this based on user feedback but double stop is an issue that is probably in the instrumentation not in Micrometer itself.)

@jonatan-ivanov jonatan-ivanov added waiting for feedback We need additional information before we can continue and removed waiting-for-triage labels Jan 7, 2025
@mathiaskinzler
Copy link
Author

mathiaskinzler commented Jan 8, 2025

@jonatan-ivanov Thanks a whole lot for the quick and exhaustive answer. I'm sorry for not having noticed that MicrometerObservationListener is in fact from Reactor, not Micrometer itself.

We are instrumenting our code using a little utility

@Component
public class TraceHelper {

	@Autowired
	ObservationRegistry registry;

	/**
	 * Adds a distributed tracing span using {@link Mono#tap(Function)}
	 *
	 * @param name                    will be converted to lower case to avoid ugly underscores being generated
	 * @param lowCardinalityKeyValues provide additional key/value pairs for the span; the values must be "low
	 *                                cardinality", see {@link Observation#lowCardinalityKeyValue(KeyValue)}
	 */
	public <T> SignalListenerFactory<T, ?> addTraceSpan(String name, KeyValue... lowCardinalityKeyValues) {
		String lowerName = name.toLowerCase(Locale.ENGLISH);
		return Micrometer.observation(registry, makeObservation(lowerName, lowCardinalityKeyValues));
	}

	private Function<ObservationRegistry, Observation> makeObservation(String name,
			KeyValue... lowCardinalityKeyValues) {
		return r -> {
			Observation observation = Observation.createNotStarted(name, r).contextualName(name);
			for (KeyValue kv : lowCardinalityKeyValues) {
				observation.lowCardinalityKeyValue(kv);
			}
			return observation;
		};
	}

}

which we use like

Mono.tap(traceHelper.addTraceSpan("somename",
		KeyValue.of("someKey", "someValue")))

As far as I understand, it does not but it breaks your tests, right? (The test registry should not be used anywhere else other than tests.)

Maybe my wording was a bit imprecise: the issue is not happening in the productive environment, but as soon as we use TestObservationRegistry as ObservationRegistry (by adding a @TestConfiguration), the issue appears and results in the observed Exceptions. Since the MicrometerObservationListener calls Observation.stop() twice (once in ´doOnNext´ and once on ´doOnCancel´), this is how the validator works, but the behavior of the validator is more strict than the behavior of SimpleObservation (which just accepts multiple stop() calls as far as I understand the code). Instead of testing for the correct instrumentation, our tests fail.

Not sure where to go from here. We could of course ask the Reactor colleagues to make their reactor-core-micrometer library smarter and issue only one stop(), but this may not exactly be trivial and pretty much impossible to test as long as the behavior of the validator differs from the behavior of SimpleObservation. What do you think?

@jonatan-ivanov
Copy link
Member

I think .tap would be the right thing to do and what you do seems ok to me but I have never used .tap like this before, let me shoot the bat signal: @chemicL. :)
In the meantime, could you please show your reactive pipeline (what is around .tap)?

@chemicL could you please help us to figure out what is happening above? It seems stop is called twice.

@mathiaskinzler
Copy link
Author

mathiaskinzler commented Jan 8, 2025

@jonatan-ivanov

could you please show your reactive pipeline

This is a bit non-trivial... The reactive pipeline (Flux) is running for several "tenants". We skip some items depending on the outcome of the first request, (first request being an OIDC authorization exchange and second request being a backend call with the retrieved token). This is done by

.flatMap(tenant -> serverAuthorization.backendAccessTokenFor(tenant)
	.flatMap(token -> callService(backendRequest, token)), 1) // 1 is the concurrency of the outer flatMap
.skipWhile(serverAuthorization::backendRespondedForbidden).take(1, true).next() //

This comment has been minimized.

@shakuzen
Copy link
Member

Don't mind the bot. We'll wait to hear from @chemicL once he has a chance to respond.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback-provided waiting for feedback We need additional information before we can continue
Projects
None yet
Development

No branches or pull requests

3 participants