Investigate spacing in element tails #661

adbar · 2024-07-26T15:09:30Z

Here is an example that only includes the word squishing issue, where whitespace between words is sometimes removed:

    def test_white_space_issue():
        from trafilatura import extract

        html_string = """<!DOCTYPE html>
        <html lang="en-us">
        <body>
        <main>
            <section>
                <p>First</p>
                This gets Squished
                <div>
                    <h4>There should be a space</h4>
                    <p>Another sentence</p>
                    This also gets Squished
                </div>
                <div>
                    <h4>Where is the space</h4>
                    <p>This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first. This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first </p>
                </div>
            </section>
        </main>
        </body>
        </html>
        """
        result = extract(html_string)
        assert "SquishedThere" not in result
        assert "SquishedWhere" not in result

Originally posted by @Psynbiotik in #660 (comment)

The text was updated successfully, but these errors were encountered:

adbar · 2024-07-26T15:09:39Z

Thanks, this is because no space is inserted between the element text and what follows directly after (without being nested in an element, a tail in LXML).

Psynbiotik · 2024-08-02T10:59:28Z

I'm curious, would the fix for this to be to add a space if one doesn't exist for certain tags or classes that would be presented in html as having a space or line break?

adbar · 2024-08-02T14:01:38Z

Yes, I also think it's about defining tags (like <p>) for which it is beneficial to insert a space.

adbar added enhancement New feature or request question Further information is requested and removed enhancement New feature or request labels Jul 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate spacing in element tails #661

Investigate spacing in element tails #661

adbar commented Jul 26, 2024

adbar commented Jul 26, 2024

Psynbiotik commented Aug 2, 2024

adbar commented Aug 2, 2024

Investigate spacing in element tails #661

Investigate spacing in element tails #661

Comments

adbar commented Jul 26, 2024

adbar commented Jul 26, 2024

Psynbiotik commented Aug 2, 2024

adbar commented Aug 2, 2024