-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Proposal: 'preload' support for Bulkhead, 'check-only' support for CircuitBreaker #2139
Comments
Hi, thank you for your proposal. You could implement a You could try to implement a new Aggregated State Management We would need to design a mechanism to aggregate the states of individual circuit breakers and bulkheads within their respective groups. Means we would need to define rules for determining the aggregated state based on the states of the composed instances (e.g., open if any circuit breaker is open, full if any bulkhead is full). We would also have to implement metrics exporters to expose the aggregated state information for consumption by external monitoring tools or dashboards. |
Hello @RobWin , thank you for your response. I was thinking it would be nice to keep all circuitbreaker things in the From my (limited) point of view I don't think it is the right way to define explicit "groups" of circuit breakers and bulkhead because that won't scale. In the BFF scenario, an application communicating with N external systems will potentially have up to 2^N such logical groups, for each combination of external systems needed by particular entry points. Also note that, for an entry point, no atomicity is required. It should simply check the relevant circuit breakers one by one in no particular order, then allocate the relevant bulkhead entries one by one in no particular order, and if something fails along the way, release any allocated resource and fail without invoking the decorated method. What is needed is only that we don't start executing code before all checks and allocations have been successfully performed. No matter what the API is, what is new is the need for a way to (1) query the circuit breakers without making them change state (whether the call succeeds or not) and (2) reserve a number of tickets off a bulkhead queue. A bulkhead registry therefore needs an additional method to do that allocation. It could default to throwing an Regarding implementation I see two general directions: one is to use a
Or we could maybe avoid the cumbersome try-with-resources with a concept of "permit" that are automatically revoked when leaving the function:
Things are more transparent and easy-to-use if using a All that could probably be done with our own out-of-tree interceptor and annotations, supported by custom bulkhead and circuit-breaker registries exposing required extra API. But I think having a well-integrated "anticipated" functionality would benefit many people with the same requirement as us. What do you think of all this? Thanks again! |
Hi, Would you like to propose a detailed solution and an implementation example as part of a PR? |
Indeed, both bulkhead preloading and check-only circuitbreaker should be doable by decorating existing code. I will look into a PR, please give me some time. |
Context
We are using Resilience4J (in particular the CircuitBreaker and Bulkhead components) in an application designed as a service-mesh, with a BFF (back-end-for-front-end) component. One call to the BFF will typically do a number of calls to other services in parallel. Many of those calls are considered critical, i.e. if any of them fails, the BFF call itself will fail as well. Each of those downstream services are protected by Resilience4J annotations.
The problem is that if the CircuitBreaker is open for one of those services, or if some of the bulkhead pools are nearly saturated, the BFF method is going to send all calls to services which are up, may fully saturate those bulkhead pools and then fail, causing unnecessary work in downstream services. This is particularly problematic with Bulkhead when the BFF method makes several calls with the same Bulkhead pool: when the system is under stress, all or nearly all calls fail with bulkhead-full exception, but the bulkhead is constantly kept full (each incoming call fills up the pool to saturation and then fails).
Our proposal
if we know a BFF method makes five parallel calls to a downstream service, we decorate the BFF service with a
@Bulkhead(preload=5)
. That will reserve five slots on the bulkhead and then (if the allocation was successful) execute the service. When the service calls downstream services (which are also@Bulkhead
-annotated, but withoutpreload
), they will use the previously reserved slots. (If the service ends up consuming more than preloaded those will be taken directly from the pool as today; if a service does not use all reserved slots, they will be released when the service returns).if we know a service calls two downstream services each with their own circuit breakers A and B, we'll annotate the service with
@CircuitBreaker(checkOnly={A, B})
(example syntax, to be discussed). It will block the call if either A or B is open, and let it go through if they are both closed or half-open. "Check-only" means that the decoration/annotation has no further impact once the call is allowed to go through (if the call fails it will not cause the circuit breaker to open, and inversely if it succeeds it will not contribute to success statistics, transition half-open to closed or open, etc.).With that mechanism we basically stop the BFF method from calling any downstream service if we know in advance that the call is going to fail anyway, thus letting downstream services a chance to recover, and reduce the amount of unnecessary work in times of stress. (Then at least some of the incoming queries are going to succeed instead of 0% in such situations…). In the case of circuit breakers, if a service needs A and B to function and A is down, we avoid flooding B with unnecessary requests.
What do you think? If you are OK with the general idea I can propose an API, precise semantics/behaviour, and then do a PR.
The text was updated successfully, but these errors were encountered: