Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New port of readability.js? #604

Open
zirkelc opened this issue May 23, 2024 · 4 comments
Open

New port of readability.js? #604

zirkelc opened this issue May 23, 2024 · 4 comments
Labels
question Further information is requested

Comments

@zirkelc
Copy link
Contributor

zirkelc commented May 23, 2024

The current implementation of readability_lxml and the original project readability.js is out of sync, so it would make sense to update the readability_lxml implementation to benefit from potentially newer feature and fixes.

readability.js has a big set of test pages with source.html and expected.html files. As a first step, we could test how readability_lxml compares against this test set?

Based on the results we could check if it's better to fix the differences (if there aren't too many), or if a new port based on the current javascript implementation would make more sense. What do you think?

By the way, readability_lxml looks like it was first ported to ruby and then to python. Besides using lxml, were there any functions added specifically for trafilatura?

@adbar adbar added the question Further information is requested label May 23, 2024
@adbar
Copy link
Owner

adbar commented May 23, 2024

The approach looks good!

The Readabillity port was using lxml even before, I just simplified it and integrated it directly since it wasn't maintained anymore.

@prnake
Copy link

prnake commented May 26, 2024

As far as I know, there is no python implementation with readability.js. It's possible to use go-readability instead of readablity_lxml with C binding, it's faster and with more precision.

@adbar
Copy link
Owner

adbar commented May 27, 2024

Generating binding means additional maintenance work but if someone wants to publish a python package for go-readability I'd be happy to test it.

@zirkelc
Copy link
Contributor Author

zirkelc commented Jun 5, 2024

@adbar just wanted to let you know I haven't forgotten about this.... I'm a bit busy at the moment, but this topic is on my list for future quiet weekends 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants