[tigta] Parsing issue with lxml=4.2.1 #308

divergentdave · 2018-05-25T03:27:59Z

I ran into a weird error with the TIGTA scraper, I was only able to reproduce it locally after upgrading lxml from 4.0.0 to 4.2.1. The page https://www.treasury.gov/tigta/publications_congress.shtml contains ISO-8859-1/windows-1252 text, and we're already decoding it correctly. When the decoded text goes through BeautifulSoup/lxml though, it comes back out corrupted. If I run the snippet below, and then print out text, it looks okay, but if I print out doc, the beginning and end look reasonable, and a large section of the middle has three spaces inserted between each character. I suspect an internal UTF-32 representation is getting mishandled somewhere along the line. I've saved a sample locally in case we need to reproduce this later. I'll poke around the lxml and libxml2 issue trackers, or maybe bisect library versions, and see if I can get to the bottom of this.

>>> from bs4 import BeautifulSoup
>>> URL = "https://www.treasury.gov/tigta/publications_congress.shtml"
>>> import requests
>>> resp = requests.get(URL)
>>> resp.encoding = "iso-8859-1"
>>> text = resp.text
>>> doc = BeautifulSoup(text, "lxml")

The text was updated successfully, but these errors were encountered:

divergentdave · 2018-05-25T03:37:13Z

More info:

the three spaces between characters are real-deal spaces
4.1.1 is good, 4.2.0 is bad

divergentdave · 2018-05-25T05:49:07Z

I can reproduce the issue when using lxml directly, without BeautifulSoup (script below). Thus far I haven't been able to reproduce the issue when building from source, only when using the released wheels.

#!/usr/bin/env python
import sys

from lxml import etree

with open("tigta_utf8.txt", "r", encoding="utf-8") as f:
    parser = etree.HTMLParser()
    parser.feed(f.read())
    doc = parser.close()
s = etree.tostring(doc)
print("round-tripped length:", len(s))
print("found spaces", b"b   o   d   y" in s)
if len(s) > 100000:
    print("bad, returning error exit code")
    sys.exit(1)
else:
    print("good, returning success exit code")

divergentdave · 2018-05-31T06:01:30Z

By building with make wheel_manylinux64, I was able to get the above test case to succeed and fail on different versions. I bisected it, and the first bad commit is lxml/lxml@9366980, "Use latest libxml2 version in binary wheels of next release." This bumps the variable MANYLINUX_LIBXML2_VERSION from 2.9.7 to 2.9.8. It seems the issue is a layer deeper, in libxml2 itself, and will require another round of bisecting.

divergentdave · 2018-06-01T01:04:20Z

This issue looks a lot like this Chromium bug https://bugs.chromium.org/p/chromium/issues/detail?id=820163#c39, which has been addressed in the development branch of libxml2. For now, I'll use lxml 4.1.1 as a workaround, and we can update once the various projects cut new releases.

Workaround for #308

konklone · 2018-06-02T18:59:16Z

Great detective work on this. That sounds like the right approach to me.

divergentdave added a commit that referenced this issue Jun 1, 2018

Blacklist lxml 4.2.0 and 4.2.1

c0f1087

Workaround for #308

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tigta] Parsing issue with lxml=4.2.1 #308

[tigta] Parsing issue with lxml=4.2.1 #308

divergentdave commented May 25, 2018

divergentdave commented May 25, 2018

divergentdave commented May 25, 2018 •

edited

Loading

divergentdave commented May 31, 2018

divergentdave commented Jun 1, 2018

konklone commented Jun 2, 2018

[tigta] Parsing issue with lxml=4.2.1 #308

[tigta] Parsing issue with lxml=4.2.1 #308

Comments

divergentdave commented May 25, 2018

divergentdave commented May 25, 2018

divergentdave commented May 25, 2018 • edited Loading

divergentdave commented May 31, 2018

divergentdave commented Jun 1, 2018

konklone commented Jun 2, 2018

divergentdave commented May 25, 2018 •

edited

Loading