You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here is an example that only includes the word squishing issue, where whitespace between words is sometimes removed:
def test_white_space_issue():
from trafilatura import extract
html_string = """<!DOCTYPE html>
<html lang="en-us">
<body>
<main>
<section>
<p>First</p>
This gets Squished
<div>
<h4>There should be a space</h4>
<p>Another sentence</p>
This also gets Squished
</div>
<div>
<h4>Where is the space</h4>
<p>This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first. This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first </p>
</div>
</section>
</main>
</body>
</html>
"""
result = extract(html_string)
assert "SquishedThere" not in result
assert "SquishedWhere" not in result
Thanks, this is because no space is inserted between the element text and what follows directly after (without being nested in an element, a tail in LXML).
I'm curious, would the fix for this to be to add a space if one doesn't exist for certain tags or classes that would be presented in html as having a space or line break?
Here is an example that only includes the word squishing issue, where whitespace between words is sometimes removed:
Originally posted by @Psynbiotik in #660 (comment)
The text was updated successfully, but these errors were encountered: