Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge multiple nodes returned by XPath #487

Closed
wants to merge 3 commits into from

Conversation

hugoobauer
Copy link

A pull request following our exchange: #4 (comment)
The goal is to extract the content scattered in several

nodes.

Before :
trafilatura recall
{'true positives': 2035, 'false positives': 231, 'true negatives': 2019, 'false negatives': 201, 'time': 10.339057922363281}
After :
trafilatura recall
{'true positives': 2032, 'false positives': 253, 'true negatives': 1997, 'false negatives': 204, 'time': 10.399115324020386}

In my search for differences, I noticed that there were potential problems with the document tests.
For example, in the document bmjv.de.konsum.html, there is "Transparenz bei Preisanpassungen" in the "without" checks, which is found in several spots in the HTML, including in the correctly extracted content. It therefore mistakenly triggers a false positive. I wanted to make this feedback on it before continuing to check the differences in the tests due to the committed changes.
Capture d’écran 2024-01-24 à 11 24 55
Capture d’écran 2024-01-24 à 11 24 37
Capture d’écran 2024-01-24 à 11 24 30

I also disabled readabilipy in tests because I couldn't get it to work, and fixed a deserialization problem in run_resiliparse.

@adbar
Copy link
Owner

adbar commented Jan 24, 2024

Hi @hugoobauer, thanks for the PR and your feedback.

You're welcome to try out the evaluation script and make changes. The same goes for the data, your example is telling, at least in the case (and any others you can find) the text segment should be replaced by another.

If I understand your results correctly your modification degrades accuracy by adding false positives? Even in favor_recall mode?

Copy link

codecov bot commented Jan 24, 2024

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (85cd3d8) 96.91% compared to head (65eccab) 96.93%.
Report is 7 commits behind head on master.

❗ Current head 65eccab differs from pull request most recent head 3f50d4e. Consider uploading reports for the commit 3f50d4e to get more accurate results

Files Patch % Lines
trafilatura/core.py 44.44% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #487      +/-   ##
==========================================
+ Coverage   96.91%   96.93%   +0.02%     
==========================================
  Files          22       22              
  Lines        3370     3428      +58     
==========================================
+ Hits         3266     3323      +57     
- Misses        104      105       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@hugoobauer
Copy link
Author

hugoobauer commented Jan 24, 2024

Yes, the change adds false positives because some web pages have multiple <article> nodes, used for related content like article recommendations.
The changes only have an impact when favor_recall=True due to the added condition.

@adbar
Copy link
Owner

adbar commented Jan 25, 2024

Then we probably need to think twice about the change because it doesn't bring much on the evaluation dataset. Maybe another heuristic is needed, or we could expose your changes to the end user through an additional parameter.

It would be nice if you'd separate the changes to the evaluation from the ongoing work. If you open another pull request focusing only on the evaluation I can merge it sooner and we can continue looking for a solution.

@hugoobauer
Copy link
Author

Sure I can create 2 separated branches.
What about readabilipy that I disabled?

For the original problem, it's tricky because there are more websites that are well implemented than websites with "broken" architecture. I've done a few manual checks, and it fixes a few extractions, but in most cases it takes article suggestions.
Some heuristics ideas :

  • minimum length check for each node
  • an algorithm based on different criteria like link density / text length. Small texts leading to other articles would therefore be excluded.
    What do you think ?

I've just tested the first idea by adding a minimum length for each node to avoid retrieving article suggestions, and the score is better.

Before any changes:
{'true positives': 2035, 'false positives': 231, 'true negatives': 2019, 'false negatives': 201, 'time': 11.863584518432617}

First suggestion :
{'true positives': 2032, 'false positives': 253, 'true negatives': 1997, 'false negatives': 204, 'time': 10.399115324020386}

-> Now:
{'true positives': 2011, 'false positives': 232, 'true negatives': 2018, 'false negatives': 225, 'time': 11.904361009597778}

The "true positives" is lower, probably because now it skips nodes with a length < 250 (MIN_EXTRACTED_SIZE), so it misses some texts. But now, false positives and true negatives are closer to the initial value.

I'll be on leave starting tomorrow for a week, I'll pick this up when I get back ;)

@adbar
Copy link
Owner

adbar commented Jan 26, 2024

  • Yes, just create another branch, you can comment out the readabilipy function. I'll test it and see if I can make it work, if not we'll remove it.
  • The length check is a good idea, the rationale would be "if there is much more text to be found in further <article> tags than in the first one, take all <article> tags, if not (teasers) only take the first one".
  • The tests are now failing but it's not a problem at this stage, feel free to work on it later on, no hurry.

@adbar adbar mentioned this pull request Jan 26, 2024
@adbar
Copy link
Owner

adbar commented Mar 15, 2024

@hugoobauer Are you still working on it or should I close the PR for now?

@hugoobauer
Copy link
Author

Hi @adbar, sorry I'm too busy with work right now, I don't have time to work on it. You can close it for now, hopefully I'll be able to work on it later.

@adbar adbar closed this Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants