You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of readability_lxml and the original project readability.js is out of sync, so it would make sense to update the readability_lxml implementation to benefit from potentially newer feature and fixes.
readability.js has a big set of test pages with source.html and expected.html files. As a first step, we could test how readability_lxml compares against this test set?
Based on the results we could check if it's better to fix the differences (if there aren't too many), or if a new port based on the current javascript implementation would make more sense. What do you think?
By the way, readability_lxml looks like it was first ported to ruby and then to python. Besides using lxml, were there any functions added specifically for trafilatura?
The text was updated successfully, but these errors were encountered:
As far as I know, there is no python implementation with readability.js. It's possible to use go-readability instead of readablity_lxml with C binding, it's faster and with more precision.
@adbar just wanted to let you know I haven't forgotten about this.... I'm a bit busy at the moment, but this topic is on my list for future quiet weekends 😄
The current implementation of
readability_lxml
and the original project readability.js is out of sync, so it would make sense to update thereadability_lxml
implementation to benefit from potentially newer feature and fixes.readability.js has a big set of test pages with source.html and expected.html files. As a first step, we could test how
readability_lxml
compares against this test set?Based on the results we could check if it's better to fix the differences (if there aren't too many), or if a new port based on the current javascript implementation would make more sense. What do you think?
By the way,
readability_lxml
looks like it was first ported to ruby and then to python. Besides usinglxml
, were there any functions added specifically for trafilatura?The text was updated successfully, but these errors were encountered: