-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode str bodies to Latin-1 instead of UTF-8 #3063
Conversation
@@ -65,7 +65,7 @@ def _test_body(self, data: bytes | str | None) -> None: | |||
|
|||
assert b"Transfer-Encoding: chunked" in header.split(b"\r\n") | |||
if data: | |||
bdata = data if isinstance(data, bytes) else data.encode("utf-8") | |||
bdata = data if isinstance(data, bytes) else data.encode("latin-1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually suspicious because 1.26.x used utf-8 here:
urllib3/test/with_dummyserver/test_chunked_transfer.py
Lines 53 to 75 in 3c01480
def _test_body(self, data): | |
self.start_chunked_handler() | |
with HTTPConnectionPool(self.host, self.port, retries=False) as pool: | |
pool.urlopen("GET", "/", data, chunked=True) | |
header, body = self.buffer.split(b"\r\n\r\n", 1) | |
assert b"Transfer-Encoding: chunked" in header.split(b"\r\n") | |
if data: | |
bdata = data if isinstance(data, bytes) else data.encode("utf-8") | |
assert b"\r\n" + bdata + b"\r\n" in body | |
assert body.endswith(b"\r\n0\r\n\r\n") | |
len_str = body.split(b"\r\n", 1)[0] | |
stated_len = int(len_str, 16) | |
assert stated_len == len(bdata) | |
else: | |
assert body == b"0\r\n\r\n" | |
def test_bytestring_body(self): | |
self._test_body(b"thisshouldbeonechunk\r\nasdf") | |
def test_unicode_body(self): | |
self._test_body(u"thisshouldbeonechunk\r\näöüß") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird, but there was no test case for non-ASCII characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few non-ASCII characters on line 75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And changing data.encode("utf-8")
to data.encode("latin-1")
breaks the test in 1.26.x
test_unicode_body failed; it passed 0 out of the required 1 times.
<class 'AssertionError'>
assert ((b'\r\n' + b'thisshouldbeonechunk\r\n\xe4\xf6\xfc\xdf') + b'\r\n') in b'1e\r\nthisshouldbeonechunk\r\n\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n0\r\n\r\n'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related encoding happens on this line in 1.26.x
urllib3/src/urllib3/connection.py
Line 274 in 57181d6
chunk = chunk.encode("utf8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good find! So we were encoding chunks with UTF-8 and other bodies with Latin-1, and 2.0 reversed that unintentionally. I'll fix that too.
Adding my two cents, if it's encoding as UTF-8 and the world isn't broken we may want to call this a "feature" and encode every body as UTF-8. That's what the world is using nowadays anyways, so we're likely to be causing more issues by using latin-1 instead? |
I like the idea, but note that chunks are encoded as UTF-8 in 1.26.x but encoded as Latin-1 in 2.x. So we should only fix that part: encoding chunks as UTF-8? |
@pquentin Oh gotcha, yeah let's UTF-8 all the things! |
Closing this PR (#3053 (comment)) |
Relates #3053 by fixing the immediate regression. We may introduce warnings in the future to help users not rely on this behavior. The changes are split in 3 commits (but squashing is fine if we merge this):
latin-1
in enough places that I preferred to be consistent and updated the remaining occurrences ofiso-8859-1
tolatin-1
.