-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tigta] Parsing issue with lxml=4.2.1 #308
Comments
More info:
|
I can reproduce the issue when using #!/usr/bin/env python
import sys
from lxml import etree
with open("tigta_utf8.txt", "r", encoding="utf-8") as f:
parser = etree.HTMLParser()
parser.feed(f.read())
doc = parser.close()
s = etree.tostring(doc)
print("round-tripped length:", len(s))
print("found spaces", b"b o d y" in s)
if len(s) > 100000:
print("bad, returning error exit code")
sys.exit(1)
else:
print("good, returning success exit code") |
By building with |
This issue looks a lot like this Chromium bug https://bugs.chromium.org/p/chromium/issues/detail?id=820163#c39, which has been addressed in the development branch of libxml2. For now, I'll use lxml 4.1.1 as a workaround, and we can update once the various projects cut new releases. |
Great detective work on this. That sounds like the right approach to me. |
I ran into a weird error with the TIGTA scraper, I was only able to reproduce it locally after upgrading
lxml
from 4.0.0 to 4.2.1. The page https://www.treasury.gov/tigta/publications_congress.shtml contains ISO-8859-1/windows-1252 text, and we're already decoding it correctly. When the decoded text goes through BeautifulSoup/lxml though, it comes back out corrupted. If I run the snippet below, and then print outtext
, it looks okay, but if I print outdoc
, the beginning and end look reasonable, and a large section of the middle has three spaces inserted between each character. I suspect an internal UTF-32 representation is getting mishandled somewhere along the line. I've saved a sample locally in case we need to reproduce this later. I'll poke around the lxml and libxml2 issue trackers, or maybe bisect library versions, and see if I can get to the bottom of this.The text was updated successfully, but these errors were encountered: