-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully shutting down a TLS server sometimes leads to the client not receiving a response #3792
Comments
Have you been able to trace what is happening? One thing that sort of sounds like is that the TLS stream perhaps hasn't flushed the response before closing? |
But do you know for sure all the requests have indeed started? Or could the shutdown be triggered just before hyper has been able to see the request bytes? |
Seems to be the problem. Most of this comment is how I arrived at that. Skip to the end for some potential solutions. Did a bit of Hmm, so if I sleep for Inlining the snippet code here for convenience: loop {
tokio::select! {
_ = &mut shutdown_receiver => {
// ADDING THIS SLEEP MAKES THE TEST PASS
tokio::time::sleep(std::time::Duration::from_millis(1)).await;
drop(shut_down_connections_rx);
break;
}
conn = tcp_listener.accept() => {
tokio::spawn(
handle_tcp_conn(
conn,
wait_for_request_to_complete_rx.clone(),
shut_down_connections_tx.clone(),
tls_config
)
);
}
}
} The Inlining the snippet here for convenience: tokio::select! {
result = conn.as_mut() => {
if let Err(err) = result {
dbg!(err);
}
}
_ = should_shut_down_connection => {
// TEST STILL FAILS IF WE SLEEP RIGHT HERE
conn.as_mut().graceful_shutdown();
let result = conn.as_mut().await;
if let Err(err) = result {
dbg!(err);
}
}
}; Key PointsThe test passes if we sleep for a millisecond before sending on the channel that leads to If I instead move the sleep to just before the This suggests that the problem occurs when we call It looks like
Yeah seems like this is the problem. ProblemCurrently, if a user opens a TCP connection to a server and the server calls This means that if the client has just begun transmitting packets, but the server has not received them, the client will get an error. Potential SolutionsPotential Solution - change how
|
#[cfg(feature = "server")] | |
pub(crate) fn disable_keep_alive(&mut self) { | |
if self.state.is_idle() { | |
trace!("disable_keep_alive; closing idle connection"); | |
self.state.close(); | |
} else { | |
trace!("disable_keep_alive; in-progress connection"); | |
self.state.disable_keep_alive(); | |
} | |
} |
If a user could disable keepalive themselves then they could:
- disable keepalive
- poll the connection for up to N seconds (i.e. by using
tokio::time::timeout
)
Potential Solution - ... I'll add more if I think of any ...
...
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I'm coming back to this having caught up some sleep and with some further discussion with @chinedufn. My main concerns were that the benefit was negligible and the only cases it mattered were artificial. I disagree with these two conclusions now. Here's my current take of why I think this should be fixed:
Therefore, I think this issue is worth solving. Sorry about all the noise I've added to the issue, I've hidden the previous comments as "resolved", for posterity. |
I do agree with some of the points brought up:
It seems likely that in this case, the time it takes to write/encrypt/read/decrypt is just enough that the tiny timeout used can beat it. |
Definitely agree that a timeout wouldn't solve 100% of cases. My thinking is that if a client opens a connection they're either:
I'm not concerned about making things easier for the attacker (1). I'd love to make things (slightly) easier for the non-malicious user who was about to give me bytes (2). I can save them a retry by handling the request. But, yes, for case (3) if it takes to long to see your bytes we'll need to close the connection. Can't wait forever. So, case (2) is what I'm interested in handling. I am operating under the assumption (no data on this currently) that with a timeout of, say, 1-2 seconds, I could handle the overwhelming majority of "I was about to give you bytes but you closed the connection on me!" scenarios. This means that my users see fewer errors, at nearly no expense to my server (I'm already comfortable taking up to 30 seconds to handle any in-progress requests before closing all connections. This 1-2 seconds would just come from that 30 second budget.) Only trade-off I can imagine is I'd say that saving the extra retry request for the client is worth it if the cost to The use case that I imagine is a server that sees heavy consistent traffic and is gracefully shut down often (say, every hour when the maintainers redeploy a new version of it). Are you open to solving this "problem" (I recognize that one might argue that it isn't a problem and the client should just be expected to retry) in Do any of the potential solutions that I've left in my previous comment seem reasonable? I'm happy to do the work required to land a solution to the "decrease the number of connections that get closed when we could have reasonably waited for a bit to get the data" problem (to repeat, yup I know we can't wait forever. I just want to reduce errors that users experience and recognize that I cannot fully eliminate them). |
Alright I took a look at the To recap, the main problem is that This means that changing the graceful shutdown implementation will lead to a better experience for end clients, meaning that developers using hyper to write web servers will be able to build (slightly) more reliable systems. Some background / related issues: In Jan 1 2022 an issue called At the time, if you called In July 2023 Currently, calling hyper should give server authors control over the grace period for graceful shutdowns so that they can decide what the best experience is for their end clients. Similar to how hyper currently allows the server author to control the header read timeout using The same level of control should exist for graceful shutdowns. Proposed Changes to HyperHere is how we can solve this graceful shutdown problem (clients unnecessarily getting errors when waiting a second or two before terminating an unwritten connection would have given nearly every connection a chance of being gracefully handled). Currently, Solution -> Lines 238 to 272 in c86a6bc
So, something like a By default the Hyper users can use This allows us to make this change now since we can fully preserve the current behavior entirely. Then we can take as much time as we like to decide whether the default should be changed from To summarize the benefits of this solution:
Next StepsI'm planning to work on implemeneting this this week. Planning to use some of the tests in Lines 1542 to 1561 in c86a6bc
Please let me know if you see any issues with this proposed solution. |
Looks like the HTTP/2 spec has support for preventing the problem that this issue is about.
via: https://datatracker.ietf.org/doc/html/rfc7540#section-6.8
So, looks like only HTTP/1 connections are experience this unnecessarily-dropping-requests-from-well-behaved-clients issue.
Note that since HTTP/1's disable keepalive means that only one more request will be processed, we can't mimic this HTTP/2 behavior of allowing many not-yet-received requests As in, if the client sent 10 requests that did not yet reach the server when Current progress: I wrote some tests and got them passing. Still need to clean up the code. Planning to submit a PR this week. The change to the public API is a The existing |
I've confirmed that HTTP/2 hyper servers do not suffer from this issue. I confirmed this by enabling the When I use |
This commit introduces a new `Connection::graceful_shutdown_with_config` method that gives users control over the HTTP/1 graceful process. Before this commit, if a graceful shutdown was initiated on an inactive connection hyper would immediately close it. As of this commit the `GracefulShutdownConfig::first_byte_read_timeout` method can be used to give inactive connections a grace period where, should the server begin receiving bytes from them before the deadline, the request will be processed. ## HTTP/2 Graceful Shutdowns This commit does not modify hyper's HTTP/2 graceful shutdown process. `hyper` already uses the HTTP/2 `GOAWAY` frame, meaning that `hyper` already gives inactive connections a brief period during which they can transmit their final requests. Note that while this commit enables slightly more graceful shutdowns for HTTP/1 servers, HTTP/2 graceful shutdowns are still superior. HTTP/2's `GOAWAY` frame allows the server to finish processing a last batch of multiple incoming requests from the client, whereas the new graceful shutdown configuration in this commit only allows the server to wait for one final incoming request to be received. This limitations stems from a limitation in HTTP/1, where there is nothing like the `GOAWAY` frame that can be used to coordinate the graceful shutdown process with the client in the face of multiple concurrent incoming requests. Instead for HTTP/1 connections `hyper` gracefully shuts down by disabling Keep-Alive, at which point the server will only receive at most one new request, even if the client has multiple requests that are moments from reaching the server. ## Motivating Use Case I'm working on a server that is being designed to handle a large amount of traffic from a large number of clients. It is expected that many clients will open many TCP connections with the server every second. As a server receives more traffic it becomes increasingly likely that at the point that it begins gracefully shutting down there are connections that were just opened, but the client's bytes have not yet been seen by the server. Before this commit, calling `Connection::graceful_shutdown` on such a freshly opened HTTP/1 connection will immediately close it. This means that the client will get an error, despite the server having been perfectly capable of handling a final request before closing the connection. This commit solves this problem for HTTP/1 clients that tend to send one request at a time. By setting a `GracefulShutdownConfig::first_byte_read_timeout` of, say, 1 second, the server will wait a moment to see if any of the client's bytes have been received. During this period, the client will have been informed that Keep-Alive is now disabled, meaning that at most one more request will be processed. Clients that have multiple in-flight requests that have not yet reached the server will have at most one of those requests handled, even if all of them reach the server before the `first_byte_read_timeout`. This is a limitation of HTTP/1. ## Work to do in other Crates #### hyper-util To expose this to users that use `hyper-util`, a method should be added to `hyper-util`'s `Connection` type. This new `hyper-util Connection::graceful_shutdown_with_config` method would expose a `http1_first_byte_read_timeout` method that would lead `hyper-util` to set `hyper GracefulShutdownConfig::first_byte_read_timeout`. --- Closes hyperium#3792
This commit introduces a new `Connection::graceful_shutdown_with_config` method that gives users control over the HTTP/1 graceful process. Before this commit, if a graceful shutdown was initiated on an inactive connection hyper would immediately close it. As of this commit the `GracefulShutdownConfig::first_byte_read_timeout` method can be used to give inactive connections a grace period where, should the server begin receiving bytes from them before the deadline, the request will be processed. ## HTTP/2 Graceful Shutdowns This commit does not modify hyper's HTTP/2 graceful shutdown process. `hyper` already uses the HTTP/2 `GOAWAY` frame, meaning that `hyper` already gives inactive connections a brief period during which they can transmit their final requests. Note that while this commit enables slightly more graceful shutdowns for HTTP/1 servers, HTTP/2 graceful shutdowns are still superior. HTTP/2's `GOAWAY` frame allows the server to finish processing a last batch of multiple incoming requests from the client, whereas the new graceful shutdown configuration in this commit only allows the server to wait for one final incoming request to be received. This limitations stems from a limitation in HTTP/1, where there is nothing like the `GOAWAY` frame that can be used to coordinate the graceful shutdown process with the client in the face of multiple concurrent incoming requests. Instead for HTTP/1 connections `hyper` gracefully shuts down by disabling Keep-Alive, at which point the server will only receive at most one new request, even if the client has multiple requests that are moments from reaching the server. ## Motivating Use Case I'm working on a server that is being designed to handle a large amount of traffic from a large number of clients. It is expected that many clients will open many TCP connections with the server every second. As a server receives more traffic it becomes increasingly likely that at the point that it begins gracefully shutting down there are connections that were just opened, but the client's bytes have not yet been seen by the server. Before this commit, calling `Connection::graceful_shutdown` on such a freshly opened HTTP/1 connection will immediately close it. This means that the client will get an error, despite the server having been perfectly capable of handling a final request before closing the connection. This commit solves this problem for HTTP/1 clients that tend to send one request at a time. By setting a `GracefulShutdownConfig::first_byte_read_timeout` of, say, 1 second, the server will wait a moment to see if any of the client's bytes have been received. During this period, the client will have been informed that Keep-Alive is now disabled, meaning that at most one more request will be processed. Clients that have multiple in-flight requests that have not yet reached the server will have at most one of those requests handled, even if all of them reach the server before the `first_byte_read_timeout`. This is a limitation of HTTP/1. ## Work to do in other Crates #### hyper-util To expose this to users that use `hyper-util`, a method should be added to `hyper-util`'s `Connection` type. This new `hyper-util Connection::graceful_shutdown_with_config` method would expose a `http1_first_byte_read_timeout` method that would lead `hyper-util` to set `hyper GracefulShutdownConfig::first_byte_read_timeout`. --- Closes hyperium#3792
I'm observing errors while testing graceful shutdown of a hyper server.
When gracefully shutting down a TLS connection the client will sometimes get an
IncompleteMessage
error.This only happens for TLS connections. The graceful shutdown process is always successful for non-TLS connections.
Given the following testing steps:
(Since the request handler has a random sleep between 5-10ms we can be reasonably confident
that when we receive a response there are still some other requests that are in-progress.)
When the hyper server is not using TLS, the test pass.
When the hyper server is using TLS, the test fails with an
IncompleteMessage
error.I've created a repository that reproduces the issue https://github.com/chinedufn/hyper-tls-graceful-shutdown-issue
Here's a quick snippet of the graceful shutdown code:
Here is the full source code for convenience (also available in the linked repository)
Cargo.toml (click to expand)
Rust code (click to expand)
The text was updated successfully, but these errors were encountered: