Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential histogram throws ArrayIndexOutOfBoundsException #5740

Closed
sfc-gh-dguy opened this issue Dec 11, 2024 · 11 comments · Fixed by #5749
Closed

Exponential histogram throws ArrayIndexOutOfBoundsException #5740

sfc-gh-dguy opened this issue Dec 11, 2024 · 11 comments · Fixed by #5749
Labels
bug A general bug registry: otlp OpenTelemetry Protocol (OTLP) registry-related
Milestone

Comments

@sfc-gh-dguy
Copy link

sfc-gh-dguy commented Dec 11, 2024

Describe the bug
We get this exception when using a Timer to emit exponential histograms. It seems like it's trying to put a value in bucket index 254614 while there are only 40 buckets (40 is what we set for max number of buckets)

We are currently still trying to log this error when it happens to understand what value produces this exception. We'll update here once we'll have more information.

This is the stack trace:

caused by ArrayIndexOutOfBoundsException: Index 254614 out of bounds for length 40
...
java.lang.invoke.VarHandle$1.apply(VarHandle.java:2187)
java.lang.invoke.VarHandle$1.apply(VarHandle.java:2184)
jdk.internal.util.Preconditions$1.apply(Preconditions.java:177)
jdk.internal.util.Preconditions$1.apply(Preconditions.java:174)
jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:926)
java.util.concurrent.atomic.AtomicLongArray.addAndGet(AtomicLongArray.java:259)
io.micrometer.registry.otlp.internal.CircularCountHolder.increment(CircularCountHolder.java:84)
io.micrometer.registry.otlp.internal.Base2ExponentialHistogram.recordDouble(Base2ExponentialHistogram.java:172)
io.micrometer.registry.otlp.internal.Base2ExponentialHistogram.recordLong(Base2ExponentialHistogram.java:151)
io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:261)
io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:54)
  • Micrometer version: 1.14.0
  • Micrometer registry: OTLP
  • OS: linux
  • Java version: 11
@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Dec 11, 2024

Thank you for the issue!

I've edited your comment to improve the formatting. You might want to check out this Mastering Markdown guide for future reference.

For the record, this discussion started on Slack: https://micrometer-metrics.slack.com/archives/C662HUJC9/p1733775869763389

I understand that you are still working on a reproducer but could you please provide a bit more details?

  • All the details that the issue template asked from you (e.g.: it seems you are using Micrometer 1.14.0 but it is not in the issue, could you please answer all of the questions there?)
  • Can you provide the rest of the stacktrace?
    • Where is this coming from? Is this a library/framework instrumentation or your own?
  • How is the Timer created and used?
  • What else do you configure? (you said you have max bucket count of 40, what else?)

@jonatan-ivanov jonatan-ivanov added waiting for feedback We need additional information before we can continue and removed waiting-for-triage labels Dec 11, 2024
@shakuzen shakuzen added the registry: otlp OpenTelemetry Protocol (OTLP) registry-related label Dec 12, 2024
@sfc-gh-dguy
Copy link
Author

sfc-gh-dguy commented Dec 12, 2024

  • added the details to the bug description
  • "Where is this coming from? Is this a library/framework instrumentation or your own?" - our own code.
  • "How is the Timer created and used?":
Timer timer

        timer =
            timerPool.computeIfAbsent(
                timerKey,
                (key) ->
                    Timer.builder(
                            String.format(TIMER_NAME_FORMAT, "StorageClient", storageType.name()))
                        .tag(API_TAG, api)
                        .publishPercentileHistogram()
                        .register(Metrics.globalRegistry));
      timer.record(currentMillis - callStartTime, TimeUnit.MILLISECONDS);
  • "What else do you configure?"
 OtlpConfig otlpConfig(
      MetricsProperties metricsProperties, List<MeterExportFilter> meterExportFilterList) {
    return new OtlpConfig() {

      @Override
      public @NotNull Duration step() {
        return Duration.ofSeconds(metricsProperties.getOtlpPeriodSec());// 60
      }

      @Override
      public @NotNull HistogramFlavor histogramFlavor() {
        return HistogramFlavor.BASE2_EXPONENTIAL_BUCKET_HISTOGRAM;
      }

      @Override
      public @NotNull int maxBucketCount() {
        return metricsProperties.getMaxBucketCount();// 40
      }

      @Override
      public String get(String s) {
        return null;
      }
    };
  }

@sfc-gh-dguy
Copy link
Author

Full stack trace:

...
caused by ArrayIndexOutOfBoundsException: Index 24965 out of bounds for length 40
...
java.lang.invoke.VarHandle$1.apply(VarHandle.java:2187)
java.lang.invoke.VarHandle$1.apply(VarHandle.java:2184)
jdk.internal.util.Preconditions$1.apply(Preconditions.java:177)
jdk.internal.util.Preconditions$1.apply(Preconditions.java:174)
jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:62)
jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
java.lang.invoke.VarHandleLongs$Array.getAndAdd(VarHandleLongs.java:926)
java.util.concurrent.atomic.AtomicLongArray.addAndGet(AtomicLongArray.java:259)
io.micrometer.registry.otlp.internal.CircularCountHolder.increment(CircularCountHolder.java:84)
io.micrometer.registry.otlp.internal.Base2ExponentialHistogram.recordDouble(Base2ExponentialHistogram.java:172)
io.micrometer.registry.otlp.internal.Base2ExponentialHistogram.recordLong(Base2ExponentialHistogram.java:151)
io.micrometer.core.instrument.AbstractTimer.record(AbstractTimer.java:261)
io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:54)
com.snowflake.common.cloud.storage.StorageClient.logClientLibraryCall(StorageClient.java:1062)
com.snowflake.common.cloud.storage.StorageClient.callCloudPlatformStorageApi(StorageClient.java:1098)
com.snowflake.common.cloud.storage.GcsStorageClient.callGcsApi(GcsStorageClient.java:323)
com.snowflake.common.cloud.storage.GcsStorageClient.getBucketLocation(GcsStorageClient.java:2125)
com.snowflake.common.cloud.storage.GcsStoragePlatform.lookupRegion(GcsStoragePlatform.java:598)
com.snowflake.common.cloud.storage.GcsStoragePlatform.lambda$findRegionForLocation$0(GcsStoragePlatform.java:568)
com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4903)
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3574)
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2316)
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2189)
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2079)
com.google.common.cache.LocalCache.get(LocalCache.java:4017)
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4898)
com.snowflake.common.cloud.storage.StoragePlatform.lookupRegion(StoragePlatform.java:556)
com.snowflake.common.cloud.storage.GcsStoragePlatform.findRegionForLocation(GcsStoragePlatform.java:568)
com.snowflake.common.cloud.storage.StorageUtil.findRegionForLocation(StorageUtil.java:4244)
com.snowflake.metadata.dictionary.Volume.getRegionInternalName(Volume.java:2488)
com.snowflake.metadata.dictionary.Volume.setEndPointForFips(Volume.java:4012)
com.snowflake.storage.VolumeProvider.lambda$adjustSecureTransportForVolumes$0(VolumeProvider.java:245)
java.lang.Iterable.forEach(Iterable.java:75)
com.snowflake.storage.VolumeProvider.adjustSecureTransportForVolumes(VolumeProvider.java:238)
com.snowflake.core.querycoordinator.XpStepAttemptExecution.finalizeExecutionPlan(XpStepAttemptExecution.java:1812)
com.snowflake.core.querycoordinator.XpStepAttemptExecution.finalizeAndScheduleOnWarehouse(XpStepAttemptExecution.java:954)
com.snowflake.core.querycoordinator.XpStepAttemptExecution.startExecutionStepAttemptOnPlatform(XpStepAttemptExecution.java:1153)
com.snowflake.core.querycoordinator.ExecutionQueryCoordinator.tryExecuteOnPlatform(ExecutionQueryCoordinator.java:1544)
com.snowflake.core.querycoordinator.ExecutionQueryCoordinator.executeOnPlatform(ExecutionQueryCoordinator.java:1353)
com.snowflake.core.querycoordinator.ExecutionQueryCoordinator.executeOnPlatform(ExecutionQueryCoordinator.java:1327)
com.snowflake.sql.execution.ExecStep.executeOnPlatform(ExecStep.java:728)
com.snowflake.sql.execution.ExecStep.execute(ExecStep.java:478)
com.snowflake.core.V2QueryCoordinator.executeProgram(V2QueryCoordinator.java:726)
com.snowflake.core.V2QueryCoordinator.execute(V2QueryCoordinator.java:3919)
com.snowflake.core.V2QueryCoordinator.callInternal(V2QueryCoordinator.java:1850)
com.snowflake.util.StackFrames.call(StackFrames.java:180)
com.snowflake.core.V2QueryCoordinator.call(V2QueryCoordinator.java:1641)
com.snowflake.core.V2QueryCoordinator.call(V2QueryCoordinator.java:302)
com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
io.micrometer.context.ContextSnapshot.lambda$wrap$0(ContextSnapshot.java:91)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
java.lang.Thread.run(Thread.java:840)]

@jonatan-ivanov
Copy link
Member

Thank you for the details.
I don't think any of this will resolve the issue but a few notes about the code above:

  • I assume timerPool is some sort of a Map. What you do here is also done by the registry (except the registry handles more use-cases like transformed meter names). .register(...) is actually a getOrCreateNew(...) with a cache backed by a Map, like what you do above. Also, you might want to look into the MeterProvider concept. Please don't use a second Map for this and use the Builder or MeterProvider directly.
  • From currentMillis, I assume you call System.currentTimeMillis(). It's a very common mistake, please don't use it to measure elapsed time, here's why. Instead you can use Timer.Sample or record(Runnable/Callable).
  • I personally don't recommend the usage of the global registry (Metrics.globalRegistry). It's hard to test and it can cause other issues due to its static nature. Use a registry instance instead and inject in into your components.

@jonatan-ivanov
Copy link
Member

jonatan-ivanov commented Dec 12, 2024

Yesterday I tried reproduce this, basically had the same config as yours but I was not able to so a reproducer would be really appreciated.
Also, now that I posted my previous comment and said it probably does not solve the issue, could you please try out what happens if you solve the second point? Please read the article I linked and use Timer.Sample or record(Runnable/Callable) instead of trying to measure elapsed time on your own. It can happen that you get some weird results (i.e.: record negative numbers). I tried negative values as well yesterday when I tried to repro your issue but it was ok for me, though just for sure, could you please fix that issue and check if you still get an exception?

Also, how frequent is this issue?

@lenin-jaganathan
Copy link
Contributor

I guess this might be related to concurrent around re-scaling. That is the only thing I could think of when the index returned seems to be very high. One thing I was initially tempted to do was full synchronization of the histogram but thought that could have negative performance issues. Let me see if I can reproduce this with some concurrency tests.

@sfc-gh-dguy Is this issue re-producible or is it very hard to re-produce (later would mean this might be related to concurrency but former might hint at something else)?

@sfc-gh-tyuan
Copy link

@lenin-jaganathan @jonatan-ivanov We reverted the change since it was impacting our production environment. It definitely seems to be a concurrency issue, since it only created an incident in our workload tests and production environment. @sfc-gh-dguy and I are adding a try/catch around the record call to safely reproduce the issue. We will update you when we have more information.

@sfc-gh-tyuan
Copy link

sfc-gh-tyuan commented Dec 19, 2024

@jonatan-ivanov @lenin-jaganathan Here are some additional logs:

Error in recording the amount and unit. amount: 3856360 unit: NANOSECONDS

Index 4971046 out of bounds for length 40
Error in recording the amount and unit. amount: 10708793 unit: NANOSECONDS

Index 40 out of bounds for length 40

Stacktrace for both:

[{"className":"java.lang.invoke.VarHandle$1","filename":"VarHandle.java","lineNumber":2187,"methodName":"apply"},
{"className":"java.lang.invoke.VarHandle$1","filename":"VarHandle.java","lineNumber":2184,"methodName":"apply"},
{"className":"jdk.internal.util.Preconditions$1","filename":"Preconditions.java","lineNumber":177,"methodName":"apply"},
{"className":"jdk.internal.util.Preconditions$1","filename":"Preconditions.java","lineNumber":174,"methodName":"apply"},
{"className":"jdk.internal.util.Preconditions","filename":"Preconditions.java","lineNumber":62,"methodName":"outOfBounds"},
{"className":"jdk.internal.util.Preconditions","filename":"Preconditions.java","lineNumber":70,"methodName":"outOfBoundsCheckIndex"},
{"className":"jdk.internal.util.Preconditions","filename":"Preconditions.java","lineNumber":266,"methodName":"checkIndex"},
{"className":"java.lang.invoke.VarHandleLongs$Array","filename":"VarHandleLongs.java","lineNumber":768,"methodName":"getVolatile"},
{"className":"java.util.concurrent.atomic.AtomicLongArray","filename":"AtomicLongArray.java","lineNumber":95,"methodName":"get"},
{"className":"io.micrometer.registry.otlp.internal.CircularCountHolder","filename":"CircularCountHolder.java","lineNumber":55,"methodName":"getValueAtIndex"},
{"className":"io.micrometer.registry.otlp.internal.Base2ExponentialHistogram","filename":"Base2ExponentialHistogram.java","lineNumber":196,"methodName":"downScale"},
{"className":"io.micrometer.registry.otlp.internal.Base2ExponentialHistogram","filename":"Base2ExponentialHistogram.java","lineNumber":175,"methodName":"recordDouble"},
{"className":"io.micrometer.registry.otlp.internal.Base2ExponentialHistogram","filename":"Base2ExponentialHistogram.java","lineNumber":151,"methodName":"recordLong"},
{"className":"io.micrometer.core.instrument.AbstractTimer","filename":"AbstractTimer.java","lineNumber":261,"methodName":"record"},
{"className":"io.micrometer.core.instrument.composite.CompositeTimer","filename":"CompositeTimer.java","lineNumber":54,"methodName":"record"},
{"className":"com.snowflake.metrics.sdk.meters.SfTimer","filename":"SfTimer.java","lineNumber":47,"methodName":"lambda$record$0"},
{"className":"com.snowflake.metrics.sdk.meters.SfMeterBuilder","filename":"SfMeterBuilder.java","lineNumber":31,"methodName":"handleExceptionVoid"},
{"className":"com.snowflake.metrics.sdk.meters.SfTimer","filename":"SfTimer.java","lineNumber":47,"methodName":"record"},
{"className":"com.snowflake.metadata.entitycache.EntityDpoCacheStats","filename":"EntityDpoCacheStats.java","lineNumber":998,"methodName":"updateGetLatencyTimer"},
{"className":"com.snowflake.metadata.entitycache.EntityDpoCache","filename":"EntityDpoCache.java","lineNumber":917,"methodName":"get"},
{"className":"com.snowflake.metadata.entitycache.EntityDpoCache","filename":"EntityDpoCache.java","lineNumber":764,"methodName":"get"},
...]

The difference is this stacktrace happens during downscaling while the previous does not.

@lenin-jaganathan
Copy link
Contributor

I have added this PR which would fix these concurrency issues. If you have a reproducer that would catch, please share it. Otherwise, I will try to work on concurrency tests sometime (possibly before next patch)

@lenin-jaganathan
Copy link
Contributor

I was able to reproduce this in the concurrency stress tests here. Updated the PR with results from before and after changes.

@shakuzen shakuzen added bug A general bug and removed waiting for feedback We need additional information before we can continue labels Dec 25, 2024
@shakuzen shakuzen added this to the 1.14.3 milestone Dec 25, 2024
@shakuzen
Copy link
Member

Thanks to @lenin-jaganathan, a fix is in the latest snapshots 1.14.3-SNAPSHOT. Let us know if you're seeing any issues with that, and thank you @sfc-gh-dguy and @sfc-gh-tyuan for all the details in reporting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug registry: otlp OpenTelemetry Protocol (OTLP) registry-related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants