-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in sync, async and threaded workers #3334
Comments
Correct. The default N.B. you tested the socket.sendfile() case - which does make surprise-free timeout enforcement more complex.. but that did not seem necessary here, because your reproducer results in the same behavior from the clients perspective, even if Previously reported as #1736 |
Ah @pajod thanks for your reply. My suspicion is that something other than the default 30 second worker timeout is at play however. I can force it to hit the timeout, say by making it really short:
and then I see errors on the server side that say But then if I set the worker timeout to be really long
or disabled:
then within around 20 to 32 seconds of starting the request, I get the ConnectionResetError in the client, and just silence in the logs on the server, and specifically no error saying |
(I've now amended the bug description to include having the worker timeout disabled) |
I see. That was merely also happening. I can see the same thing, even if the timeout does not kick in. The last ~50 chunks of queued writes are not seen by the slow-reading client, unless I make the server |
So this is rolling around my head. From some googling, it looks like if a socket is closed while there is data in its read buffer, it will then send a RST to the other end. Could this be what the sync worker is doing? If so, could probably come up with an even more minimal example, without using a large file at all and just small amounts of data and some deliberately positioned sleeps… |
Have now got a slightly more minimal example. A 3-line server (no Flask and no files) in test.py def app(environ, start_response):
start_response('200 OK', [])
return [b'-' * 6553600] that's run with
And then code to emulate a slow client: import urllib3
import time
def data():
time.sleep(1)
yield b'-'
resp = urllib3.request("POST", "http://localhost:8000/",
body=data(),
preload_content=False,
)
for chunk in resp.stream(65536):
time.sleep(0.001) And within around 3 seconds, consistently, the client code raises a ConnectionResetError. This is fairly similar to the issues in production I've seen - errors happening within a low number of seconds. |
Aha! From poking around around code, also found that the error occurs on gevent and gthread workers if keepalive is disabled, so
or
I suspect what causes the issue to not occur is the keepalive loop, and specifically the fact that it calls the http parser in the loop, which has code to "Discard any unread body of the previous message", which means that when the socket's close function is called, the chance of there being unread data in the socket's read buffer is much smaller (although... not impossible if the client just snuck in some data "last minute"?) If keep-alive is not enabled - either by specific configuration or by using the sync worker - then there is no keep alive loop, which means the parser won't be called again to read and discard unread body data, and so for a slow client there is a much higher chance that data is in the socket's read buffer by the time its close function is called. Am tempted to amend the issue title and description |
I wonder if the fix to this would be to call
Just before closing the socket, in both sync and async workers? From some googling, this might more correctly ignore any unread data that was sent from the client? |
Actually have found a variation of request code that shows the error as well. Don't need a sleep to delay sending the request body, as long as there is a fair amount of it: import urllib3
import time
resp = urllib3.request("POST", "http://localhost:8000/",
body=b'-' * 65536,
preload_content=False,
)
time.sleep(1)
for chunk in resp.stream(65536):
time.sleep(0.005) |
Note
Earlier versions of this bug limited to GET requests, and limited to when the body of the request was empty and for
transfer-encoding: chunked
requests, but these were inaccurate. The issue happens on both GET and POST requests, and when the request body is non-empty, and for both chunked and non-chunked request bodies. Earlier versions also did not make clear the error happened even if the worker timeout is disabled, and were for just sync workers whereas the issue happens in all workers except tornado where keep-alive is not possible or disabledIf I make a large empty file:
A basic gunicorn+Flask server to serve it (without reading the http request body) in test.py:
Run with disabled worker timeout (to exclude that as the issue), and on any of
sync
,gevent
,eventlet
orgthread
worker classes, as long as keep-alive is not possible or disabled:And then some code to slowly make a GET or POST request to retrieve the large file through the webserver, but in a way that forces the request to be slow, specifically being slow about both ending the request body and fetching the response:
Then almost always I get the error on the client:
And no error on the server at all.
Interestingly:
flask --app test run --port 8080
instead of gunicorn, the error doesn't occuruwsgi --http :8000 --wsgi-file test.py --callable app
, the error doesn't occurgevent
,eventlet
orgthread
, the error doesn't occurprint(request.data)
, then the error doesn't occurtornado
worker class, the error does not occurAnd:
headers={'content-length': '8'}
in the above case, the issue still occursI'm on gunicorn 23.0.0 on macOS (but I suspect this also happens on Linux in Docker, since I'm chasing down a production issue)
Running Wireshark on the connection I do see an RST from server to client just before the error, and nothing obviously unusual before (although not sure if I would recognise much)
The text was updated successfully, but these errors were encountered: