Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tigta] Parsing issue with lxml=4.2.1 #308

Open
divergentdave opened this issue May 25, 2018 · 5 comments
Open

[tigta] Parsing issue with lxml=4.2.1 #308

divergentdave opened this issue May 25, 2018 · 5 comments

Comments

@divergentdave
Copy link
Contributor

I ran into a weird error with the TIGTA scraper, I was only able to reproduce it locally after upgrading lxml from 4.0.0 to 4.2.1. The page https://www.treasury.gov/tigta/publications_congress.shtml contains ISO-8859-1/windows-1252 text, and we're already decoding it correctly. When the decoded text goes through BeautifulSoup/lxml though, it comes back out corrupted. If I run the snippet below, and then print out text, it looks okay, but if I print out doc, the beginning and end look reasonable, and a large section of the middle has three spaces inserted between each character. I suspect an internal UTF-32 representation is getting mishandled somewhere along the line. I've saved a sample locally in case we need to reproduce this later. I'll poke around the lxml and libxml2 issue trackers, or maybe bisect library versions, and see if I can get to the bottom of this.

>>> from bs4 import BeautifulSoup
>>> URL = "https://www.treasury.gov/tigta/publications_congress.shtml"
>>> import requests
>>> resp = requests.get(URL)
>>> resp.encoding = "iso-8859-1"
>>> text = resp.text
>>> doc = BeautifulSoup(text, "lxml")
@divergentdave
Copy link
Contributor Author

More info:

  • the three spaces between characters are real-deal spaces
  • 4.1.1 is good, 4.2.0 is bad

@divergentdave
Copy link
Contributor Author

divergentdave commented May 25, 2018

I can reproduce the issue when using lxml directly, without BeautifulSoup (script below). Thus far I haven't been able to reproduce the issue when building from source, only when using the released wheels.

#!/usr/bin/env python
import sys

from lxml import etree

with open("tigta_utf8.txt", "r", encoding="utf-8") as f:
    parser = etree.HTMLParser()
    parser.feed(f.read())
    doc = parser.close()
s = etree.tostring(doc)
print("round-tripped length:", len(s))
print("found spaces", b"b   o   d   y" in s)
if len(s) > 100000:
    print("bad, returning error exit code")
    sys.exit(1)
else:
    print("good, returning success exit code")

@divergentdave
Copy link
Contributor Author

By building with make wheel_manylinux64, I was able to get the above test case to succeed and fail on different versions. I bisected it, and the first bad commit is lxml/lxml@9366980, "Use latest libxml2 version in binary wheels of next release." This bumps the variable MANYLINUX_LIBXML2_VERSION from 2.9.7 to 2.9.8. It seems the issue is a layer deeper, in libxml2 itself, and will require another round of bisecting.

@divergentdave
Copy link
Contributor Author

This issue looks a lot like this Chromium bug https://bugs.chromium.org/p/chromium/issues/detail?id=820163#c39, which has been addressed in the development branch of libxml2. For now, I'll use lxml 4.1.1 as a workaround, and we can update once the various projects cut new releases.

divergentdave added a commit that referenced this issue Jun 1, 2018
@konklone
Copy link
Member

konklone commented Jun 2, 2018

Great detective work on this. That sounds like the right approach to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants