Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Proposal: 'preload' support for Bulkhead, 'check-only' support for CircuitBreaker #2139

Open
elca-maxime-gamboni opened this issue Mar 14, 2024 · 4 comments

Comments

@elca-maxime-gamboni
Copy link

elca-maxime-gamboni commented Mar 14, 2024

Context

We are using Resilience4J (in particular the CircuitBreaker and Bulkhead components) in an application designed as a service-mesh, with a BFF (back-end-for-front-end) component. One call to the BFF will typically do a number of calls to other services in parallel. Many of those calls are considered critical, i.e. if any of them fails, the BFF call itself will fail as well. Each of those downstream services are protected by Resilience4J annotations.

The problem is that if the CircuitBreaker is open for one of those services, or if some of the bulkhead pools are nearly saturated, the BFF method is going to send all calls to services which are up, may fully saturate those bulkhead pools and then fail, causing unnecessary work in downstream services. This is particularly problematic with Bulkhead when the BFF method makes several calls with the same Bulkhead pool: when the system is under stress, all or nearly all calls fail with bulkhead-full exception, but the bulkhead is constantly kept full (each incoming call fills up the pool to saturation and then fails).

Our proposal

  1. add a "preload" mechanism for bulkhead:
    if we know a BFF method makes five parallel calls to a downstream service, we decorate the BFF service with a @Bulkhead(preload=5). That will reserve five slots on the bulkhead and then (if the allocation was successful) execute the service. When the service calls downstream services (which are also @Bulkhead-annotated, but without preload), they will use the previously reserved slots. (If the service ends up consuming more than preloaded those will be taken directly from the pool as today; if a service does not use all reserved slots, they will be released when the service returns).
  2. add a "check-only" variant of circuitBreaker:
    if we know a service calls two downstream services each with their own circuit breakers A and B, we'll annotate the service with @CircuitBreaker(checkOnly={A, B}) (example syntax, to be discussed). It will block the call if either A or B is open, and let it go through if they are both closed or half-open. "Check-only" means that the decoration/annotation has no further impact once the call is allowed to go through (if the call fails it will not cause the circuit breaker to open, and inversely if it succeeds it will not contribute to success statistics, transition half-open to closed or open, etc.).

With that mechanism we basically stop the BFF method from calling any downstream service if we know in advance that the call is going to fail anyway, thus letting downstream services a chance to recover, and reduce the amount of unnecessary work in times of stress. (Then at least some of the incoming queries are going to succeed instead of 0% in such situations…). In the case of circuit breakers, if a service needs A and B to function and A is down, we avoid flooding B with unnecessary requests.

What do you think? If you are OK with the general idea I can propose an API, precise semantics/behaviour, and then do a PR.

@RobWin
Copy link
Member

RobWin commented Mar 17, 2024

Hi,

thank you for your proposal.
I would like to avoid to touch the existing Annotations and Aspects.

You could implement a @CircuitBreakerGroup annotation to allow grouping multiple circuit breakers under one logical unit. You could develop an aspect that intercepts method invocations annotated with @CircuitBreakerGroup and synchronizes the state checks of all associated circuit breakers. You would need to ensure that the aspect efficiently manages concurrent access to circuit breaker states and synchronizes circuit breaker transitions if necessary.

You could try to implement a new @BulkheadGroup Annotation which combines multiple Bulkheads.
The problem is that there is currently no way to reserve a "ticket" for the queue nor ask the Bulkhead if it's full in a synchronized way.

Aggregated State Management

We would need to design a mechanism to aggregate the states of individual circuit breakers and bulkheads within their respective groups. Means we would need to define rules for determining the aggregated state based on the states of the composed instances (e.g., open if any circuit breaker is open, full if any bulkhead is full). We would also have to implement metrics exporters to expose the aggregated state information for consumption by external monitoring tools or dashboards.

@elca-maxime-gamboni
Copy link
Author

Hello @RobWin , thank you for your response.

I was thinking it would be nice to keep all circuitbreaker things in the @CircuitBreaker annotation, and all bulkhead things in the @Bulkhead annotation but it's also fine with me to keep this functionality in distinct annotations. Maybe emphasise the "ahead-of-time" aspect with an annotation like @Anticipate(checkCircuitBreakers = { "A", "B" }, reserveBulkheads = { @ReserveBulkhead(name = "A", amount = 4), @ReserveBulkhead(name = "B", amount = 2)}) or something like that?

From my (limited) point of view I don't think it is the right way to define explicit "groups" of circuit breakers and bulkhead because that won't scale. In the BFF scenario, an application communicating with N external systems will potentially have up to 2^N such logical groups, for each combination of external systems needed by particular entry points. Also note that, for an entry point, no atomicity is required. It should simply check the relevant circuit breakers one by one in no particular order, then allocate the relevant bulkhead entries one by one in no particular order, and if something fails along the way, release any allocated resource and fail without invoking the decorated method. What is needed is only that we don't start executing code before all checks and allocations have been successfully performed.

No matter what the API is, what is new is the need for a way to (1) query the circuit breakers without making them change state (whether the call succeeds or not) and (2) reserve a number of tickets off a bulkhead queue. A bulkhead registry therefore needs an additional method to do that allocation. It could default to throwing an UnsupportedOperationException so existing custom implementations are unaffected by the appearance of the new API method.

Regarding implementation I see two general directions: one is to use a ThreadLocal<Integer> that keeps track of how many bulkhead slots have been allocated but not yet consumed by the current thread. Another is to allow constructing a new single-use decorated proxy:

void myEntryPoint() {
  try (
    Function<Integer, String> preloadedFunction = Decorators.ofFunction(fun)
      .withBulkhead(Bulkhead.ofDefaults("A").preload(4))
      .get()) { // pre-allocation occurs here (or fails with bulkead full)

    // then, later:
    preloadedFunction.apply(1);
    preloadedFunction.apply(2);
    preloadedFunction.apply(3);
    preloadedFunction.apply(4);
  } // unused slots are freed when we leave the try-with-resources
}

Or we could maybe avoid the cumbersome try-with-resources with a concept of "permit" that are automatically revoked when leaving the function:

// allocation occurs in interceptor before calling the method
void myEntryPoint(@ReserveBulkhead(name = "A", amount = 4) Bulkhead bh, ...) {
  Function<Integer, String> preloadedFunction = Decorators.ofFunction(fun)
      .withBulkhead(bh) // the private preloadedFunction instance holds the permits
      .get();

  // then, later, possibly in other threads, but before myFunction returns:
  preloadedFunction.apply(1);
  preloadedFunction.apply(2);
  preloadedFunction.apply(3);
  preloadedFunction.apply(4);
} // Permits held by bh are revoked by the interceptor when the method terminates

Things are more transparent and easy-to-use if using a ThreadLocal but may fail to use allocated entries if the calls are done from another thread. On the other hand, using an AutoCloseable creates the risk that we forget to release a permit in some case.

All that could probably be done with our own out-of-tree interceptor and annotations, supported by custom bulkhead and circuit-breaker registries exposing required extra API. But I think having a well-integrated "anticipated" functionality would benefit many people with the same requirement as us. What do you think of all this?

Thanks again!

@RobWin
Copy link
Member

RobWin commented Mar 20, 2024

Hi,
I would prefer to use a decorator approach where you can stack functionality on top of each other. Introducing new decorators to enhance existing components such as CircuitBreakers or Bulkheads aligns well with our design philosophy. It should also make it easier to test the functionality in isolation.

Would you like to propose a detailed solution and an implementation example as part of a PR?

@elca-maxime-gamboni
Copy link
Author

Indeed, both bulkhead preloading and check-only circuitbreaker should be doable by decorating existing code. I will look into a PR, please give me some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants