Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate spacing in element tails #661

Open
adbar opened this issue Jul 26, 2024 · 3 comments
Open

Investigate spacing in element tails #661

adbar opened this issue Jul 26, 2024 · 3 comments
Labels
question Further information is requested

Comments

@adbar
Copy link
Owner

adbar commented Jul 26, 2024

Here is an example that only includes the word squishing issue, where whitespace between words is sometimes removed:

    def test_white_space_issue():
        from trafilatura import extract

        html_string = """<!DOCTYPE html>
        <html lang="en-us">
        <body>
        <main>
            <section>
                <p>First</p>
                This gets Squished
                <div>
                    <h4>There should be a space</h4>
                    <p>Another sentence</p>
                    This also gets Squished
                </div>
                <div>
                    <h4>Where is the space</h4>
                    <p>This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first. This sentence has to be long enough. If it's long enough the duplication stops, but if it's not long enough then you'll get an extra first </p>
                </div>
            </section>
        </main>
        </body>
        </html>
        """
        result = extract(html_string)
        assert "SquishedThere" not in result
        assert "SquishedWhere" not in result

Originally posted by @Psynbiotik in #660 (comment)

@adbar
Copy link
Owner Author

adbar commented Jul 26, 2024

Thanks, this is because no space is inserted between the element text and what follows directly after (without being nested in an element, a tail in LXML).

@adbar adbar added enhancement New feature or request question Further information is requested and removed enhancement New feature or request labels Jul 26, 2024
@Psynbiotik
Copy link

I'm curious, would the fix for this to be to add a space if one doesn't exist for certain tags or classes that would be presented in html as having a space or line break?

@adbar
Copy link
Owner Author

adbar commented Aug 2, 2024

Yes, I also think it's about defining tags (like <p>) for which it is beneficial to insert a space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants