Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection backoff with limited retries hangs indefinitely #1586

Open
kdrag0n opened this issue Apr 11, 2023 · 3 comments
Open

Connection backoff with limited retries hangs indefinitely #1586

kdrag0n opened this issue Apr 11, 2023 · 3 comments

Comments

@kdrag0n
Copy link

kdrag0n commented Apr 11, 2023

Describe the bug

For local use, I'm trying to configure the Swift gRPC client's connection backoff so that it:

  • Retries quickly with no jitter or exponential backoff
  • Gives up and fails the RPC after a short timeout (e.g. 10 sec)

The issue is that once ConnectionBackoffIterator reaches the retry limit, the pool never schedules another connection attempt — all future RPCs fail after maxWaitTime. So effectively, once there has been a connection failure, the GRPCChannelPool (or ClientConnection) becomes unusable. It also doesn't propagate the connection error until maxWaitTime is reached, even if retries are exhausted before that.

RPCs fail with this error:

deadlineExceeded(connectionError: Optional(unavailable (14): Unexpected connection drop))
deadline exceeded (4): Timed out waiting for an HTTP/2 stream from the connection pool, cause: unavailable (14): Unexpected connection drop

Allowing unlimited retries fixes the issue of the pool becoming unusable, but that's not a viable solution because the pool doesn't seem to stop retrying even after all RPCs have timed out. The server could be stopped and may not be started again for hours or days, and my quick-retry requirement means that the app will be wasting a lot of CPU trying to connect in the background.

Some RPCs could take a very long time when there are no connection issues, so I can't set a short timeLimit on CallOptions.

I wouldn't mind doing the retry logic myself (even if temporarily) if the gRPC behavior won't work for my needs, but the same occurs with ClientConnection and connectionBackoff = nil, and GRPCChannelPool doesn't seem to support completely bypassing the backoff logic.

grpc-swift version: 1.14.2
Debug build
Platform: macOS 13

To reproduce

GRPCChannelPool.with(...) {
    $0.connectionPool = GRPCChannelPool.Configuration.ConnectionPool.with { pool in
        pool.maxWaitTime = .seconds(5)
    }
    $0.connectionBackoff = ConnectionBackoff(
        initialBackoff: 0.1,
        maximumBackoff: 0.25,
        multiplier: 1,
        jitter: 0,
        //minimumConnectionTimeout: 0.1,
        retries: .upTo(50)
    )
}

Or with the old ClientConnection API:

var config = ClientConnection.Configuration.default(...)
config.connectionBackoff = ConnectionBackoff(
        initialBackoff: 0.1,
        maximumBackoff: 0.25,
        multiplier: 1,
        jitter: 0,
        //minimumConnectionTimeout: 0.1,
        retries: .upTo(50)
)
config.callStartBehavior = .waitsForConnectivity
return ClientConnection(configuration: config)

I've tried various combinations of retries, minimumConnectionTimeout, maxWaitTime, and callStartBehavior to no avail.

Expected behaviour

Describe what you expect to happen when the steps above are followed.

Additional information

Logs with retry limit = 1:

2023-04-11T12:34:07-0700 debug grpc : poolmanager.id=ObjectIdentifier(0x0000600002dc8480) poolmanager.pools.conns_per_pool=1 poolmanager.pools.count=1 poolmanager.pools.waiters_per_pool=100 [GRPC] initializing connection pool manager
2023-04-11 12:34:07.806 OrbStack[2862:19592931] start rpc
2023-04-11T12:34:07-0700 debug grpc call opt : grpc_request_id=35EA3191-EF0A-4010-841B-91B173B6692A pool.waiter.id=ObjectIdentifier(0x0000600001bb1fe0) pool.waiters.count=0 [GRPC] waiting for a connection to become available
2023-04-11T12:34:07-0700 trace grpc : pool.waiters.count=1 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] enqueued connection waiter
2023-04-11T12:34:07-0700 debug grpc : pool.reservations.capacity=0 pool.reservations.count=1 pool.reservations.load=inf pool.reservations.loadThreshold=0.9 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] stream reservation load factor greater than or equal to threshold, bringing up additional connection if available
2023-04-11T12:34:07-0700 trace grpc call opt : call_state=awaitingTransport (0 parts buffered) grpc_request_id=35EA3191-EF0A-4010-841B-91B173B6692A request_part=metadata [GRPC] buffering request part
2023-04-11T12:34:07-0700 trace grpc call opt : call_state=awaitingTransport (1 parts buffered) grpc_request_id=35EA3191-EF0A-4010-841B-91B173B6692A request_part=message [GRPC] buffering request part
2023-04-11T12:34:07-0700 trace grpc call opt : call_state=awaitingTransport (2 parts buffered) grpc_request_id=35EA3191-EF0A-4010-841B-91B173B6692A request_part=end [GRPC] buffering request part
2023-04-11T12:34:07-0700 debug grpc call opt : grpc_request_id=EF7FFE9A-D02D-40E3-82A4-33A50C034E9B pool.waiter.id=ObjectIdentifier(0x0000600001bb1db0) pool.waiters.count=1 [GRPC] waiting for a connection to become available
2023-04-11T12:34:07-0700 trace grpc : pool.waiters.count=2 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] enqueued connection waiter
2023-04-11T12:34:07-0700 trace grpc call opt : call_state=awaitingTransport (0 parts buffered) grpc_request_id=EF7FFE9A-D02D-40E3-82A4-33A50C034E9B request_part=metadata [GRPC] buffering request part
2023-04-11T12:34:07-0700 trace grpc call opt : call_state=awaitingTransport (1 parts buffered) grpc_request_id=EF7FFE9A-D02D-40E3-82A4-33A50C034E9B request_part=message [GRPC] buffering request part
2023-04-11T12:34:07-0700 trace grpc call opt : call_state=awaitingTransport (2 parts buffered) grpc_request_id=EF7FFE9A-D02D-40E3-82A4-33A50C034E9B request_part=end [GRPC] buffering request part
2023-04-11T12:34:07-0700 debug grpc : grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/0 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] making client bootstrap with event loop group of type SelectableEventLoop
2023-04-11T12:34:07-0700 debug grpc : grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/0 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] Network.framework is available but the EventLoopGroup is not compatible with NIOTS, falling back to ClientBootstrap
2023-04-11T12:34:07-0700 debug grpc : grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/0 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] creating a ClientBootstrap
2023-04-11T12:34:07-0700 debug grpc : connectivity_state=connecting grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/0 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] activating connection
2023-04-11T12:34:08-0700 debug grpc : connectivity_state=active grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/0 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] deactivating connection
2023-04-11T12:34:08-0700 debug grpc : delay_secs=0.1 grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/0 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] scheduling connection attempt
2023-04-11T12:34:08-0700 debug grpc : grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/1 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] making client bootstrap with event loop group of type SelectableEventLoop
2023-04-11T12:34:08-0700 debug grpc : grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/1 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] Network.framework is available but the EventLoopGroup is not compatible with NIOTS, falling back to ClientBootstrap
2023-04-11T12:34:08-0700 debug grpc : grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/1 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] creating a ClientBootstrap
2023-04-11T12:34:08-0700 debug grpc : connectivity_state=connecting grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/1 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] activating connection
2023-04-11T12:34:08-0700 debug grpc : connectivity_state=active grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/1 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] deactivating connection
2023-04-11T12:34:08-0700 debug grpc : grpc_connection_id=A72D788A-D053-4D99-B714-58835717CF64/1 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] shutting down connection




2023-04-11 12:34:12.968 OrbStack[2862:19592931] RPC returned error: deadlineExceeded(connectionError: Optional(unavailable (14): Unexpected connection drop))
2023-04-11 12:34:12.968 OrbStack[2862:19592931] RPC returned error: deadlineExceeded(connectionError: Optional(unavailable (14): Unexpected connection drop))
2023-04-11T12:34:12-0700 trace grpc call opt : call_state=closing grpc_request_id=35EA3191-EF0A-4010-841B-91B173B6692A [GRPC] failing buffered writes
2023-04-11T12:34:12-0700 trace grpc call opt : grpc_request_id=35EA3191-EF0A-4010-841B-91B173B6692A pool.waiter.id=ObjectIdentifier(0x0000600001bb1fe0) pool.waiters.count=1 [GRPC] timed out waiting for a connection
2023-04-11T12:34:12-0700 trace grpc call opt : call_state=closing grpc_request_id=EF7FFE9A-D02D-40E3-82A4-33A50C034E9B [GRPC] failing buffered writes
2023-04-11T12:34:12-0700 trace grpc call opt : grpc_request_id=EF7FFE9A-D02D-40E3-82A4-33A50C034E9B pool.waiter.id=ObjectIdentifier(0x0000600001bb1db0) pool.waiters.count=0 [GRPC] timed out waiting for a connection
2023-04-11T12:34:13-0700 debug grpc call opt : grpc_request_id=A0FD37C0-5B52-4796-B708-A1E0BF62B861 pool.waiter.id=ObjectIdentifier(0x0000600001b994f0) pool.waiters.count=0 [GRPC] waiting for a connection to become available
2023-04-11T12:34:13-0700 trace grpc : pool.waiters.count=1 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] enqueued connection waiter
2023-04-11T12:34:13-0700 debug grpc call opt : grpc_request_id=C98F5343-E420-4DFF-ABB7-7FF4BFDF1EC5 pool.waiter.id=ObjectIdentifier(0x0000600001b99450) pool.waiters.count=1 [GRPC] waiting for a connection to become available
2023-04-11T12:34:13-0700 trace grpc : pool.waiters.count=2 poolmanager.id=ObjectIdentifier(0x0000600002dc8480) [GRPC] enqueued connection waiter
2023-04-11T12:34:13-0700 trace grpc call opt : call_state=awaitingTransport (0 parts buffered) grpc_request_id=C98F5343-E420-4DFF-ABB7-7FF4BFDF1EC5 request_part=metadata [GRPC] buffering request part
2023-04-11T12:34:13-0700 trace grpc call opt : call_state=awaitingTransport (1 parts buffered) grpc_request_id=C98F5343-E420-4DFF-ABB7-7FF4BFDF1EC5 request_part=message [GRPC] buffering request part
2023-04-11T12:34:13-0700 trace grpc call opt : call_state=awaitingTransport (2 parts buffered) grpc_request_id=C98F5343-E420-4DFF-ABB7-7FF4BFDF1EC5 request_part=end [GRPC] buffering request part
2023-04-11T12:34:13-0700 trace grpc call opt : call_state=awaitingTransport (0 parts buffered) grpc_request_id=A0FD37C0-5B52-4796-B708-A1E0BF62B861 request_part=metadata [GRPC] buffering request part
2023-04-11T12:34:13-0700 trace grpc call opt : call_state=awaitingTransport (1 parts buffered) grpc_request_id=A0FD37C0-5B52-4796-B708-A1E0BF62B861 request_part=message [GRPC] buffering request part
2023-04-11T12:34:13-0700 trace grpc call opt : call_state=awaitingTransport (2 parts buffered) grpc_request_id=A0FD37C0-5B52-4796-B708-A1E0BF62B861 request_part=end [GRPC] buffering request part
@kdrag0n kdrag0n added the bug label Apr 11, 2023
@glbrntt
Copy link
Collaborator

glbrntt commented Apr 13, 2023

It sounds like the right solution here would be a mode on the connection pool to allow it to avoid eagerly reconnecting if there are no outstanding requests. That would allow you to configure a short backoff (and short wait time) without imposing a retry limit. Would that work for your use case?

@glbrntt glbrntt added enhancement and removed bug labels Apr 13, 2023
@kdrag0n
Copy link
Author

kdrag0n commented Apr 13, 2023

Yes, that should work!

@kdrag0n
Copy link
Author

kdrag0n commented Apr 23, 2023

Thinking about this more, the current behavior seems potentially problematic even in the common case of iOS apps that use gRPC with changing network conditions. Consider the following scenario:

  • App is being used in the foreground
  • Network connection is lost
  • User leaves the app to idle or puts it in the background
  • Network connection is restored

When the user comes back (minutes to hours later), wouldn't they have to wait up to 2 minutes (default max backoff) for the app to be usable again because of the backoff logic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants