Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The metric "cache.load.duration" in CaffeineCacheMetrics binder has the wrong Meter type. #5803

Open
wikthewiz opened this issue Jan 9, 2025 · 2 comments
Labels
bug A general bug
Milestone

Comments

@wikthewiz
Copy link

Describe the bug
In the binder CaffeineCacheMetrics.java (the metric cache.load.duration metric) uses a TimeGauge to store the value reported from method in totalLoadTime() in caffeine, which is an ever growing value: (https://www.javadoc.io/doc/com.github.ben-manes.caffeine/caffeine/2.2.2/com/github/benmanes/caffeine/cache/stats/CacheStats.html#totalLoadTime--).

Environment
I found this in our production environment

  • micrometer version: 1.12.4 but it is present in the latest version also
  • Prometheus
  • Linux
  • openjdk 21.0.4 2024-07-16 LTS

To Reproduce

TimeGauge.builder("cache.load.duration", cache, TimeUnit.NANOSECONDS, c -> c.stats().totalLoadTime())

Expected behavior
The expected behaviour is to have the cache.load.duration as a meter of Counter Type.

Additional context
I asked in the slack channel about this and @jonatan-ivanov kindly answered and asked me to report this a as bug. He also had some good suggestions on how what type to use. Here is a link to that answer: https://micrometer-metrics.slack.com/archives/C98P1USPR/p1736375533224759?thread_ts=1736323360.611049&cid=C98P1USPR

@jonatan-ivanov
Copy link
Member

Copying my thoughts here from Slack:

The javadoc of totalLoadTime says thatit’s an ever increasing value so you are right, maybe this should be a counter so that it signals the nature and you can easily calculate the rate on the backend. I guess the reason why this isn’t a counter is because different backends use different time units and our counters don’t have the notion of time but TimeGauge can convert the recorded value to the unit of the backend (seconds for Prometheus, millis for StatsD, etc.).

But we are not 100% consistent on that either check ProcessorMetrics.java, the cpu time is a counter there and the unit is always ns even if everything time related in your registry is in seconds.

So in the case of TimeGauge (caffeine), you will get the right unit but not the counter nature but in the case of FunctionCounter (cpu time) you will get the counter nature but the same unit for every registry.

Maybe we should use a FunctionTimer and for its count property use successCount + failureCount. Or come up with a new Meter that is like a FunctionTimer without a count property (only tracks total time), basically a FunctionCounter that can convert time units.

@jonatan-ivanov jonatan-ivanov added bug A general bug and removed waiting-for-triage labels Jan 10, 2025
@jonatan-ivanov jonatan-ivanov added this to the 1.13.x milestone Jan 10, 2025
@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Jan 10, 2025

I looked into this a little, effectively, CaffeineCacheMetrics does this:

TimeGauge.builder("cache.load.duration", cache, TimeUnit.MILLISECONDS, c -> 300).register(registry);
FunctionCounter.builder("cache.load", cache, c -> 1).tags("result", "success").register(registry);
FunctionCounter.builder("cache.load", cache, c -> 2).tags("result", "failure").register(registry);

So the output is something like this:

# HELP cache_load_total  
# TYPE cache_load_total counter
cache_load_total{result="failure"} 2.0
cache_load_total{result="success"} 1.0
# HELP cache_load_duration_seconds  
# TYPE cache_load_duration_seconds gauge
cache_load_duration_seconds 0.3

Maybe we should do this:

FunctionTimer.builder("cache.load.duration", cache, c -> 3, c -> 300, TimeUnit.MILLISECONDS).register(registry);
FunctionCounter.builder("cache.load", cache, c -> 1).tags("result", "success").register(registry);
FunctionCounter.builder("cache.load", cache, c -> 2).tags("result", "failure").register(registry);

So the output would look like like:

# HELP cache_load_total  
# TYPE cache_load_total counter
cache_load_total{result="failure"} 2.0
cache_load_total{result="success"} 1.0
# HELP cache_load_duration_seconds  
# TYPE cache_load_duration_seconds summary
cache_load_duration_seconds_count 3
cache_load_duration_seconds_sum 0.3

Other than cache_load_duration_seconds_count is redundant (equals to cache_load_total), the above (as-is) is a breaking change so we need to figure out how to keep the old time series (cache_load_duration_seconds) too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug
Projects
None yet
Development

No branches or pull requests

2 participants