-
-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge multiple nodes returned by XPath #487
Conversation
Hi @hugoobauer, thanks for the PR and your feedback. You're welcome to try out the evaluation script and make changes. The same goes for the data, your example is telling, at least in the case (and any others you can find) the text segment should be replaced by another. If I understand your results correctly your modification degrades accuracy by adding false positives? Even in |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #487 +/- ##
==========================================
+ Coverage 96.91% 96.93% +0.02%
==========================================
Files 22 22
Lines 3370 3428 +58
==========================================
+ Hits 3266 3323 +57
- Misses 104 105 +1 ☔ View full report in Codecov by Sentry. |
Yes, the change adds false positives because some web pages have multiple |
Then we probably need to think twice about the change because it doesn't bring much on the evaluation dataset. Maybe another heuristic is needed, or we could expose your changes to the end user through an additional parameter. It would be nice if you'd separate the changes to the evaluation from the ongoing work. If you open another pull request focusing only on the evaluation I can merge it sooner and we can continue looking for a solution. |
Sure I can create 2 separated branches. For the original problem, it's tricky because there are more websites that are well implemented than websites with "broken" architecture. I've done a few manual checks, and it fixes a few extractions, but in most cases it takes article suggestions.
I've just tested the first idea by adding a minimum length for each node to avoid retrieving article suggestions, and the score is better.
The "true positives" is lower, probably because now it skips nodes with a length < 250 (MIN_EXTRACTED_SIZE), so it misses some texts. But now, false positives and true negatives are closer to the initial value. I'll be on leave starting tomorrow for a week, I'll pick this up when I get back ;) |
|
@hugoobauer Are you still working on it or should I close the PR for now? |
Hi @adbar, sorry I'm too busy with work right now, I don't have time to work on it. You can close it for now, hopefully I'll be able to work on it later. |
A pull request following our exchange: #4 (comment)
nodes.The goal is to extract the content scattered in several
Before :
trafilatura recall
{'true positives': 2035, 'false positives': 231, 'true negatives': 2019, 'false negatives': 201, 'time': 10.339057922363281}
After :
trafilatura recall
{'true positives': 2032, 'false positives': 253, 'true negatives': 1997, 'false negatives': 204, 'time': 10.399115324020386}
In my search for differences, I noticed that there were potential problems with the document tests.
For example, in the document
bmjv.de.konsum.html
, there is "Transparenz bei Preisanpassungen" in the "without" checks, which is found in several spots in the HTML, including in the correctly extracted content. It therefore mistakenly triggers a false positive. I wanted to make this feedback on it before continuing to check the differences in the tests due to the committed changes.I also disabled readabilipy in tests because I couldn't get it to work, and fixed a deserialization problem in
run_resiliparse
.