Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in sync, async and threaded workers #3334

michalc · 2024-12-30T20:44:37Z

Note

Earlier versions of this bug limited to GET requests, and limited to when the body of the request was empty and for transfer-encoding: chunked requests, but these were inaccurate. The issue happens on both GET and POST requests, and when the request body is non-empty, and for both chunked and non-chunked request bodies. Earlier versions also did not make clear the error happened even if the worker timeout is disabled, and were for just sync workers whereas the issue happens in all workers except tornado where keep-alive is not possible or disabled

If I make a large empty file:

mkfile -n 1g temp_1GB_file

A basic gunicorn+Flask server to serve it (without reading the http request body) in test.py:

from flask import Flask

app = Flask(__name__, static_folder='.')

@app.route('/<path:path>', methods=['GET', 'POST'])
def static_proxy(path):
    return app.send_static_file(path)

Run with disabled worker timeout (to exclude that as the issue), and on any of sync, gevent , eventlet or gthread worker classes, as long as keep-alive is not possible or disabled:

gunicorn --worker-class sync --workers=1 --timeout=0 --keep-alive=0 test:app

And then some code to slowly make a GET or POST request to retrieve the large file through the webserver, but in a way that forces the request to be slow, specifically being slow about both ending the request body and fetching the response:

import urllib3
import time

def data():
    time.sleep(4)
    yield b'Anything'

resp = urllib3.request("POST", "http://localhost:8000/temp_1GB_file", 
    body=data(),
    preload_content=False,
)
for chunk in resp.stream(65536):
    time.sleep(0.001)

Then almost always I get the error on the client:

urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))

And no error on the server at all.

Interestingly:

if I remove the sleeps in the request code, the error doesn't occur
or if I use Flask's built-in webserver via flask --app test run --port 8080 instead of gunicorn, the error doesn't occur
or if I use uwsgi via uwsgi --http :8000 --wsgi-file test.py --callable app, the error doesn't occur
of if I re-enable keep-alive on the gevent, eventlet or gthread, the error doesn't occur
or if I access the bytes of the request body on the server before returning from the handler, say from print(request.data), then the error doesn't occur
or if I reduce the amount of data returned from the server, the error does not occur
or if I use the the tornado worker class, the error does not occur

And:

if I send the request body without chunked encoding, i.e. by passing headers={'content-length': '8'} in the above case, the issue still occurs

I'm on gunicorn 23.0.0 on macOS (but I suspect this also happens on Linux in Docker, since I'm chasing down a production issue)

Running Wireshark on the connection I do see an RST from server to client just before the error, and nothing obviously unusual before (although not sure if I would recognise much)

The text was updated successfully, but these errors were encountered:

pajod · 2024-12-31T14:03:18Z

there is something specific to the gunicorn sync worker

Correct. The default --timeout=30 puts a hard limit on continuous sends in the sync worker. The worker is treated as misbehaving and aborted if unresponsive for that time. Whereas the async workers stay alive even while busy.

N.B. you tested the socket.sendfile() case - which does make surprise-free timeout enforcement more complex.. but that did not seem necessary here, because your reproducer results in the same behavior from the clients perspective, even if --no-sendfile is used.

Previously reported as #1736

michalc · 2024-12-31T14:28:53Z

Ah @pajod thanks for your reply.

My suspicion is that something other than the default 30 second worker timeout is at play however. I can force it to hit the timeout, say by making it really short:

gunicorn --workers=1 --timeout 10 test:app

and then I see errors on the server side that say [CRITICAL] WORKER TIMEOUT for example.

But then if I set the worker timeout to be really long

gunicorn --workers=1 --timeout 3600 test:app

or disabled:

gunicorn --workers=1 --timeout 0 test:app

then within around 20 to 32 seconds of starting the request, I get the ConnectionResetError in the client, and just silence in the logs on the server, and specifically no error saying [CRITICAL] WORKER TIMEOUT

michalc · 2024-12-31T14:34:59Z

(I've now amended the bug description to include having the worker timeout disabled)

pajod · 2024-12-31T16:58:17Z

something other than the default 30 second worker timeout is at play

I see. That was merely also happening. I can see the same thing, even if the timeout does not kick in. The last ~50 chunks of queued writes are not seen by the slow-reading client, unless I make the server sleep(1) before closing the socket.

michalc · 2024-12-31T18:54:12Z

Running Wireshark on the connection I do see an RST from server to client just before the error

So this is rolling around my head. From some googling, it looks like if a socket is closed while there is data in its read buffer, it will then send a RST to the other end. Could this be what the sync worker is doing?

If so, could probably come up with an even more minimal example, without using a large file at all and just small amounts of data and some deliberately positioned sleeps…

michalc · 2024-12-31T22:18:16Z

Have now got a slightly more minimal example. A 3-line server (no Flask and no files) in test.py

def app(environ, start_response):
    start_response('200 OK', [])
    return [b'-' * 6553600]

that's run with

gunicorn test:app

And then code to emulate a slow client:

import urllib3
import time

def data():
    time.sleep(1)
    yield b'-'

resp = urllib3.request("POST", "http://localhost:8000/", 
    body=data(),
    preload_content=False,
)
for chunk in resp.stream(65536):
    time.sleep(0.001)

And within around 3 seconds, consistently, the client code raises a ConnectionResetError. This is fairly similar to the issues in production I've seen - errors happening within a low number of seconds.

michalc · 2025-01-01T09:07:47Z

Aha! From poking around around code, also found that the error occurs on gevent and gthread workers if keepalive is disabled, so

gunicorn --workers=1 --worker-class=gevent --keep-alive=0 test:app

or

gunicorn --workers=1 --worker-class=gthread --keep-alive=0 test:app

I suspect what causes the issue to not occur is the keepalive loop, and specifically the fact that it calls the http parser in the loop, which has code to "Discard any unread body of the previous message", which means that when the socket's close function is called, the chance of there being unread data in the socket's read buffer is much smaller (although... not impossible if the client just snuck in some data "last minute"?)

If keep-alive is not enabled - either by specific configuration or by using the sync worker - then there is no keep alive loop, which means the parser won't be called again to read and discard unread body data, and so for a slow client there is a much higher chance that data is in the socket's read buffer by the time its close function is called.

Am tempted to amend the issue title and description

michalc · 2025-01-01T09:23:01Z

I wonder if the fix to this would be to call

client.shutdown(socket.SHUT_RDWR)

Just before closing the socket, in both sync and async workers? From some googling, this might more correctly ignore any unread data that was sent from the client?

michalc · 2025-01-02T06:05:35Z

Actually have found a variation of request code that shows the error as well. Don't need a sleep to delay sending the request body, as long as there is a fair amount of it:

import urllib3
import time

resp = urllib3.request("POST", "http://localhost:8000/", 
    body=b'-' * 65536,
    preload_content=False,
)
time.sleep(1)
for chunk in resp.stream(65536):
    time.sleep(0.005)

michalc changed the title ~~GET request with empty transfer-encoding: chunked body causes Connection reset by peer on large responses~~ Request with empty transfer-encoding: chunked body causes Connection reset by peer on large responses Dec 31, 2024

michalc changed the title ~~Request with empty transfer-encoding: chunked body causes Connection reset by peer on large responses~~ Request with slow transfer-encoding: chunked body causes Connection reset by peer on large responses Dec 31, 2024

michalc changed the title ~~Request with slow transfer-encoding: chunked body causes Connection reset by peer on large responses~~ Slow request with transfer-encoding: chunked body causes Connection reset by peer on large responses Dec 31, 2024

michalc changed the title ~~Slow request with transfer-encoding: chunked body causes Connection reset by peer on large responses~~ Slow request with transfer-encoding: chunked body causes ConnectionResetError on large responses Dec 31, 2024

michalc changed the title ~~Slow request with transfer-encoding: chunked body causes ConnectionResetError on large responses~~ Slow request causes ConnectionResetError on large responses Dec 31, 2024

michalc changed the title ~~Slow request causes ConnectionResetError on large responses~~ Slow request causes ConnectionResetError on large responses in sync worker Dec 31, 2024

michalc changed the title ~~Slow request causes ConnectionResetError on large responses in sync worker~~ Slow request and not reading request body causes ConnectionResetError on large responses in sync worker Dec 31, 2024

michalc mentioned this issue Dec 31, 2024

fix: avoid unnecessarily chunking upstream requests uktrade/ip-filter#56

Merged

michalc changed the title ~~Slow request and not reading request body causes ConnectionResetError on large responses in sync worker~~ Slow request and not reading request body causes ConnectionResetError on large responses in sync worker, even if worker timeout is disabled Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in sync, async and threaded workers #3334

Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in sync, async and threaded workers #3334

michalc commented Dec 30, 2024 •

edited

Loading

pajod commented Dec 31, 2024 •

edited

Loading

michalc commented Dec 31, 2024 •

edited

Loading

michalc commented Dec 31, 2024

pajod commented Dec 31, 2024

michalc commented Dec 31, 2024

michalc commented Dec 31, 2024 •

edited

Loading

michalc commented Jan 1, 2025

michalc commented Jan 1, 2025

michalc commented Jan 2, 2025 •

edited

Loading

Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in sync, async and threaded workers #3334

Slow request and not reading request body causes ConnectionResetError on large responses if keep-alive is disabled, even if worker timeout is disabled and in sync, async and threaded workers #3334

Comments

michalc commented Dec 30, 2024 • edited Loading

pajod commented Dec 31, 2024 • edited Loading

michalc commented Dec 31, 2024 • edited Loading

michalc commented Dec 31, 2024

pajod commented Dec 31, 2024

michalc commented Dec 31, 2024

michalc commented Dec 31, 2024 • edited Loading

michalc commented Jan 1, 2025

michalc commented Jan 1, 2025

michalc commented Jan 2, 2025 • edited Loading

michalc commented Dec 30, 2024 •

edited

Loading

pajod commented Dec 31, 2024 •

edited

Loading

michalc commented Dec 31, 2024 •

edited

Loading

michalc commented Dec 31, 2024 •

edited

Loading

michalc commented Jan 2, 2025 •

edited

Loading