Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC thread pool saturated by blocked Identity calls #18697

Closed
npepinpe opened this issue May 22, 2024 · 2 comments · Fixed by #19005
Closed

gRPC thread pool saturated by blocked Identity calls #18697

npepinpe opened this issue May 22, 2024 · 2 comments · Fixed by #19005
Assignees
Labels
component/zeebe Related to the Zeebe component/team kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc.

Comments

@npepinpe
Copy link
Member

Description

We use a custom instance of the ForkJoinPool in the gateway to handle our gRPC service logic. Meaning, a gRPC request comes in, it's passed to the Netty event executor loop, and then when it's ready to execute the business logic (e.g. EndpointManager), it's passed to our ForkJoinPool.

This pool is configurable by users, with a minimum/maximum number of threads. By default, the parallelism level (core pool size) is the number of CPU cores, and the max number of threads is twice the number of CPU cores.

This works fine with our normal code, since all of it is asynchronous. However, we do use the Identity SDK to retrieve credentials and tenant IDs, which performs some HTTP calls. These calls are blocking, and will end up blocking the ForkJoinPool thread using ForkJoinPool#managedBlocker. This causes the current thread to be marked as blocked, and the pool to try and compensate this by spawning another thread (to maintain the level of parallelism).

However, if you've reached the maximum number of threads (e.g. 2 * cpuCores), then an exception is thrown, something like this:

io.camunda.zeebe.gateway.interceptors.impl.IdentityInterceptor - Denying call gateway_protocol.Gateway/ActivateJobs as the authorized tenants could not be retrieved from Identity. Error message: Thread limit exceeded replacing blocked worker

Since we're clearly blocked on I/O, I think we could likely compensate and spawn more threads (of course, eventually too many threads is an issue of its own). But this leads us to two options:

  1. Can Identity be configured to doing things async and not block? This would avoid blocking threads, so we can keep the parallelism up without spawning tons of threads. But this leads to more "complex" code, especially for other places where things don't need to be async.
  2. We can have a higher maximum number of threads. For example, the default common ForkJoinPool allows 256 + Math.max(1, Runtime.getRuntime().availableProcessors() - 1). The downside here is, I honestly can't tell why 256. The actual bound for the max number of threads is 0x7fff, any other value given will be clamped down to this. From their docs:

The default value for common pool maxSpares. Overridable using the "java.util.concurrent.ForkJoinPool.common.maximumSpares" system property. The default value is far in excess of normal requirements, but also far short of MAX_CAP and typical OS thread limits, so allows JVMs to catch misuse/abuse before running out of resources needed to do so.

I would lean towards option 2, but there's still the downside that if users configure a small number of threads, and then will run into this error. But we could start with that - change the default of the max number of gRPC threads to be 256 + Math.max(1, Runtime.getRuntime().availableProcessors() - 1), and still let users completely override it if they want.

@npepinpe npepinpe added kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc. component/zeebe Related to the Zeebe component/team labels May 22, 2024
@npepinpe
Copy link
Member Author

ZDP:

  • Increase default to 256.
  • Switch the executor to ThreadPerTaskExecutor (something like that, maybe not the exact name) with virtual threads.
  • Ratio of effort to impact is high (low effort), so I would say let's do it soon.
  • Timebox 1-2h to try the virtual threads approach with gRPC.

I will ping the Identity team about having asynchronous alternatives for their SDK.

@npepinpe
Copy link
Member Author

Update on Identity, they're open to it but don't see a big need, and we may want to wait for the mono repo merge before looking into it. Though if we move to virtual thread, there is no real pressure to do any async things here.

github-merge-queue bot pushed a commit that referenced this issue Jun 3, 2024
## Description

## Related issues

closes #18697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/zeebe Related to the Zeebe component/team kind/toil Categorizes an issue or PR as general maintenance, i.e. cleanup, refactoring, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants