gRPC thread pool saturated by blocked Identity calls #18697

npepinpe · 2024-05-22T06:43:59Z

Description

We use a custom instance of the ForkJoinPool in the gateway to handle our gRPC service logic. Meaning, a gRPC request comes in, it's passed to the Netty event executor loop, and then when it's ready to execute the business logic (e.g. EndpointManager), it's passed to our ForkJoinPool.

This pool is configurable by users, with a minimum/maximum number of threads. By default, the parallelism level (core pool size) is the number of CPU cores, and the max number of threads is twice the number of CPU cores.

This works fine with our normal code, since all of it is asynchronous. However, we do use the Identity SDK to retrieve credentials and tenant IDs, which performs some HTTP calls. These calls are blocking, and will end up blocking the ForkJoinPool thread using ForkJoinPool#managedBlocker. This causes the current thread to be marked as blocked, and the pool to try and compensate this by spawning another thread (to maintain the level of parallelism).

However, if you've reached the maximum number of threads (e.g. 2 * cpuCores), then an exception is thrown, something like this:

io.camunda.zeebe.gateway.interceptors.impl.IdentityInterceptor - Denying call gateway_protocol.Gateway/ActivateJobs as the authorized tenants could not be retrieved from Identity. Error message: Thread limit exceeded replacing blocked worker

Since we're clearly blocked on I/O, I think we could likely compensate and spawn more threads (of course, eventually too many threads is an issue of its own). But this leads us to two options:

Can Identity be configured to doing things async and not block? This would avoid blocking threads, so we can keep the parallelism up without spawning tons of threads. But this leads to more "complex" code, especially for other places where things don't need to be async.
We can have a higher maximum number of threads. For example, the default common ForkJoinPool allows 256 + Math.max(1, Runtime.getRuntime().availableProcessors() - 1). The downside here is, I honestly can't tell why 256. The actual bound for the max number of threads is 0x7fff, any other value given will be clamped down to this. From their docs:

The default value for common pool maxSpares. Overridable using the "java.util.concurrent.ForkJoinPool.common.maximumSpares" system property. The default value is far in excess of normal requirements, but also far short of MAX_CAP and typical OS thread limits, so allows JVMs to catch misuse/abuse before running out of resources needed to do so.

I would lean towards option 2, but there's still the downside that if users configure a small number of threads, and then will run into this error. But we could start with that - change the default of the max number of gRPC threads to be 256 + Math.max(1, Runtime.getRuntime().availableProcessors() - 1), and still let users completely override it if they want.

The text was updated successfully, but these errors were encountered:

npepinpe · 2024-05-23T12:14:34Z

ZDP:

Increase default to 256.
Switch the executor to ThreadPerTaskExecutor (something like that, maybe not the exact name) with virtual threads.
Ratio of effort to impact is high (low effort), so I would say let's do it soon.
Timebox 1-2h to try the virtual threads approach with gRPC.

I will ping the Identity team about having asynchronous alternatives for their SDK.

npepinpe · 2024-05-31T13:49:41Z

Update on Identity, they're open to it but don't see a big need, and we may want to wait for the mono repo merge before looking into it. Though if we move to virtual thread, there is no real pressure to do any async things here.

## Description ## Related issues closes #18697

npepinpe added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. component/zeebe Related to the Zeebe component/team labels May 22, 2024

jwulf mentioned this issue May 22, 2024

Intermittent error on multi-tenancy: "Error: 16 UNAUTHENTICATED: Expected Identity to provide authorized tenants, see cause for details" camunda/camunda-8-js-sdk#53

Closed

npepinpe assigned EuroLew May 30, 2024

EuroLew mentioned this issue May 31, 2024

change grpc executor to virtual threads #19005

Merged

github-merge-queue bot pushed a commit that referenced this issue Jun 3, 2024

change grpc executor to virtual threads (#19005)

c4078d8

## Description ## Related issues closes #18697

npepinpe closed this as completed in #19005 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC thread pool saturated by blocked Identity calls #18697

gRPC thread pool saturated by blocked Identity calls #18697

npepinpe commented May 22, 2024

npepinpe commented May 23, 2024

npepinpe commented May 31, 2024

gRPC thread pool saturated by blocked Identity calls #18697

gRPC thread pool saturated by blocked Identity calls #18697

Comments

npepinpe commented May 22, 2024

npepinpe commented May 23, 2024

npepinpe commented May 31, 2024