Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in sync, async and threaded workers #3334

Open
michalc opened this issue Dec 30, 2024 · 9 comments

Comments

@michalc
Copy link

michalc commented Dec 30, 2024

Note

Earlier versions of this bug limited to GET requests, and limited to when the body of the request was empty and for transfer-encoding: chunked requests, but these were inaccurate. The issue happens on both GET and POST requests, and when the request body is non-empty, and for both chunked and non-chunked request bodies. Earlier versions also did not make clear the error happened even if the worker timeout is disabled, and were for just sync workers whereas the issue happens in all workers except tornado where keep-alive is not possible or disabled

If I make a large empty file:

mkfile -n 1g temp_1GB_file

A basic gunicorn+Flask server to serve it (without reading the http request body) in test.py:

from flask import Flask

app = Flask(__name__, static_folder='.')

@app.route('/<path:path>', methods=['GET', 'POST'])
def static_proxy(path):
    return app.send_static_file(path)

Run with disabled worker timeout (to exclude that as the issue), and on any of sync, gevent , eventlet or gthread worker classes, as long as keep-alive is not possible or disabled:

gunicorn --worker-class sync --workers=1 --timeout=0 --keep-alive=0 test:app

And then some code to slowly make a GET or POST request to retrieve the large file through the webserver, but in a way that forces the request to be slow, specifically being slow about both ending the request body and fetching the response:

import urllib3
import time

def data():
    time.sleep(4)
    yield b'Anything'

resp = urllib3.request("POST", "http://localhost:8000/temp_1GB_file", 
    body=data(),
    preload_content=False,
)
for chunk in resp.stream(65536):
    time.sleep(0.001)

Then almost always I get the error on the client:

urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))

And no error on the server at all.

Interestingly:

  • if I remove the sleeps in the request code, the error doesn't occur
  • or if I use Flask's built-in webserver via flask --app test run --port 8080 instead of gunicorn, the error doesn't occur
  • or if I use uwsgi via uwsgi --http :8000 --wsgi-file test.py --callable app, the error doesn't occur
  • of if I re-enable keep-alive on the gevent, eventlet or gthread, the error doesn't occur
  • or if I access the bytes of the request body on the server before returning from the handler, say from print(request.data), then the error doesn't occur
  • or if I reduce the amount of data returned from the server, the error does not occur
  • or if I use the the tornado worker class, the error does not occur

And:

  • if I send the request body without chunked encoding, i.e. by passing headers={'content-length': '8'} in the above case, the issue still occurs

I'm on gunicorn 23.0.0 on macOS (but I suspect this also happens on Linux in Docker, since I'm chasing down a production issue)

Running Wireshark on the connection I do see an RST from server to client just before the error, and nothing obviously unusual before (although not sure if I would recognise much)

@michalc michalc changed the title GET request with empty transfer-encoding: chunked body causes Connection reset by peer on large responses Request with empty transfer-encoding: chunked body causes Connection reset by peer on large responses Dec 31, 2024
@michalc michalc changed the title Request with empty transfer-encoding: chunked body causes Connection reset by peer on large responses Request with slow transfer-encoding: chunked body causes Connection reset by peer on large responses Dec 31, 2024
@michalc michalc changed the title Request with slow transfer-encoding: chunked body causes Connection reset by peer on large responses Slow request with transfer-encoding: chunked body causes Connection reset by peer on large responses Dec 31, 2024
@michalc michalc changed the title Slow request with transfer-encoding: chunked body causes Connection reset by peer on large responses Slow request with transfer-encoding: chunked body causes ConnectionResetError on large responses Dec 31, 2024
@michalc michalc changed the title Slow request with transfer-encoding: chunked body causes ConnectionResetError on large responses Slow request causes ConnectionResetError on large responses Dec 31, 2024
@michalc michalc changed the title Slow request causes ConnectionResetError on large responses Slow request causes ConnectionResetError on large responses in sync worker Dec 31, 2024
@michalc michalc changed the title Slow request causes ConnectionResetError on large responses in sync worker Slow request and not reading request body causes ConnectionResetError on large responses in sync worker Dec 31, 2024
@pajod
Copy link
Contributor

pajod commented Dec 31, 2024

there is something specific to the gunicorn sync worker

Correct. The default --timeout=30 puts a hard limit on continuous sends in the sync worker. The worker is treated as misbehaving and aborted if unresponsive for that time. Whereas the async workers stay alive even while busy.

N.B. you tested the socket.sendfile() case - which does make surprise-free timeout enforcement more complex.. but that did not seem necessary here, because your reproducer results in the same behavior from the clients perspective, even if --no-sendfile is used.


Previously reported as #1736

@michalc
Copy link
Author

michalc commented Dec 31, 2024

Ah @pajod thanks for your reply.

My suspicion is that something other than the default 30 second worker timeout is at play however. I can force it to hit the timeout, say by making it really short:

gunicorn --workers=1 --timeout 10 test:app

and then I see errors on the server side that say [CRITICAL] WORKER TIMEOUT for example.

But then if I set the worker timeout to be really long

gunicorn --workers=1 --timeout 3600 test:app

or disabled:

gunicorn --workers=1 --timeout 0 test:app

then within around 20 to 32 seconds of starting the request, I get the ConnectionResetError in the client, and just silence in the logs on the server, and specifically no error saying [CRITICAL] WORKER TIMEOUT

@michalc
Copy link
Author

michalc commented Dec 31, 2024

(I've now amended the bug description to include having the worker timeout disabled)

@michalc michalc changed the title Slow request and not reading request body causes ConnectionResetError on large responses in sync worker Slow request and not reading request body causes ConnectionResetError on large responses in sync worker, even if worker timeout is disabled Dec 31, 2024
@pajod
Copy link
Contributor

pajod commented Dec 31, 2024

something other than the default 30 second worker timeout is at play

I see. That was merely also happening. I can see the same thing, even if the timeout does not kick in. The last ~50 chunks of queued writes are not seen by the slow-reading client, unless I make the server sleep(1) before closing the socket.

@michalc
Copy link
Author

michalc commented Dec 31, 2024

Running Wireshark on the connection I do see an RST from server to client just before the error

So this is rolling around my head. From some googling, it looks like if a socket is closed while there is data in its read buffer, it will then send a RST to the other end. Could this be what the sync worker is doing?

If so, could probably come up with an even more minimal example, without using a large file at all and just small amounts of data and some deliberately positioned sleeps…

@michalc
Copy link
Author

michalc commented Dec 31, 2024

Have now got a slightly more minimal example. A 3-line server (no Flask and no files) in test.py

def app(environ, start_response):
    start_response('200 OK', [])
    return [b'-' * 6553600]

that's run with

gunicorn test:app

And then code to emulate a slow client:

import urllib3
import time

def data():
    time.sleep(1)
    yield b'-'

resp = urllib3.request("POST", "http://localhost:8000/", 
    body=data(),
    preload_content=False,
)
for chunk in resp.stream(65536):
    time.sleep(0.001)

And within around 3 seconds, consistently, the client code raises a ConnectionResetError. This is fairly similar to the issues in production I've seen - errors happening within a low number of seconds.

@michalc
Copy link
Author

michalc commented Jan 1, 2025

Aha! From poking around around code, also found that the error occurs on gevent and gthread workers if keepalive is disabled, so

gunicorn --workers=1 --worker-class=gevent --keep-alive=0 test:app

or

gunicorn --workers=1 --worker-class=gthread --keep-alive=0 test:app

I suspect what causes the issue to not occur is the keepalive loop, and specifically the fact that it calls the http parser in the loop, which has code to "Discard any unread body of the previous message", which means that when the socket's close function is called, the chance of there being unread data in the socket's read buffer is much smaller (although... not impossible if the client just snuck in some data "last minute"?)

If keep-alive is not enabled - either by specific configuration or by using the sync worker - then there is no keep alive loop, which means the parser won't be called again to read and discard unread body data, and so for a slow client there is a much higher chance that data is in the socket's read buffer by the time its close function is called.

Am tempted to amend the issue title and description

@michalc michalc changed the title Slow request and not reading request body causes ConnectionResetError on large responses in sync worker, even if worker timeout is disabled Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in both sync and async workers Jan 1, 2025
@michalc
Copy link
Author

michalc commented Jan 1, 2025

I wonder if the fix to this would be to call

client.shutdown(socket.SHUT_RDWR)

Just before closing the socket, in both sync and async workers? From some googling, this might more correctly ignore any unread data that was sent from the client?

@michalc michalc changed the title Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in both sync and async workers Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in sync, async and threaded workers Jan 1, 2025
@michalc
Copy link
Author

michalc commented Jan 2, 2025

Actually have found a variation of request code that shows the error as well. Don't need a sleep to delay sending the request body, as long as there is a fair amount of it:

import urllib3
import time

resp = urllib3.request("POST", "http://localhost:8000/", 
    body=b'-' * 65536,
    preload_content=False,
)
time.sleep(1)
for chunk in resp.stream(65536):
    time.sleep(0.005)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants