Merge multiple nodes returned by XPath #487

hugoobauer · 2024-01-24T10:45:00Z

A pull request following our exchange: #4 (comment)
The goal is to extract the content scattered in several

nodes.

Before :
trafilatura recall
{'true positives': 2035, 'false positives': 231, 'true negatives': 2019, 'false negatives': 201, 'time': 10.339057922363281}
After :
trafilatura recall
{'true positives': 2032, 'false positives': 253, 'true negatives': 1997, 'false negatives': 204, 'time': 10.399115324020386}

In my search for differences, I noticed that there were potential problems with the document tests.
For example, in the document bmjv.de.konsum.html, there is "Transparenz bei Preisanpassungen" in the "without" checks, which is found in several spots in the HTML, including in the correctly extracted content. It therefore mistakenly triggers a false positive. I wanted to make this feedback on it before continuing to check the differences in the tests due to the committed changes.

I also disabled readabilipy in tests because I couldn't get it to work, and fixed a deserialization problem in run_resiliparse.

… <article>)

adbar · 2024-01-24T11:30:59Z

Hi @hugoobauer, thanks for the PR and your feedback.

You're welcome to try out the evaluation script and make changes. The same goes for the data, your example is telling, at least in the case (and any others you can find) the text segment should be replaced by another.

If I understand your results correctly your modification degrades accuracy by adding false positives? Even in favor_recall mode?

codecov · 2024-01-24T11:31:19Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (85cd3d8) 96.91% compared to head (65eccab) 96.93%.
Report is 7 commits behind head on master.

❗ Current head 65eccab differs from pull request most recent head 3f50d4e. Consider uploading reports for the commit 3f50d4e to get more accurate results

Files	Patch %	Lines
trafilatura/core.py	44.44%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #487      +/-   ##
==========================================
+ Coverage   96.91%   96.93%   +0.02%     
==========================================
  Files          22       22              
  Lines        3370     3428      +58     
==========================================
+ Hits         3266     3323      +57     
- Misses        104      105       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

hugoobauer · 2024-01-24T15:40:01Z

Yes, the change adds false positives because some web pages have multiple <article> nodes, used for related content like article recommendations.
The changes only have an impact when favor_recall=True due to the added condition.

adbar · 2024-01-25T13:09:34Z

Then we probably need to think twice about the change because it doesn't bring much on the evaluation dataset. Maybe another heuristic is needed, or we could expose your changes to the end user through an additional parameter.

It would be nice if you'd separate the changes to the evaluation from the ongoing work. If you open another pull request focusing only on the evaluation I can merge it sooner and we can continue looking for a solution.

hugoobauer · 2024-01-25T18:07:00Z

Sure I can create 2 separated branches.
What about readabilipy that I disabled?

For the original problem, it's tricky because there are more websites that are well implemented than websites with "broken" architecture. I've done a few manual checks, and it fixes a few extractions, but in most cases it takes article suggestions.
Some heuristics ideas :

minimum length check for each node
an algorithm based on different criteria like link density / text length. Small texts leading to other articles would therefore be excluded.
What do you think ?

I've just tested the first idea by adding a minimum length for each node to avoid retrieving article suggestions, and the score is better.

Before any changes:
{'true positives': 2035, 'false positives': 231, 'true negatives': 2019, 'false negatives': 201, 'time': 11.863584518432617}

First suggestion :
{'true positives': 2032, 'false positives': 253, 'true negatives': 1997, 'false negatives': 204, 'time': 10.399115324020386}

-> Now:
{'true positives': 2011, 'false positives': 232, 'true negatives': 2018, 'false negatives': 225, 'time': 11.904361009597778}

The "true positives" is lower, probably because now it skips nodes with a length < 250 (MIN_EXTRACTED_SIZE), so it misses some texts. But now, false positives and true negatives are closer to the initial value.

I'll be on leave starting tomorrow for a week, I'll pick this up when I get back ;)

adbar · 2024-01-26T11:52:55Z

Yes, just create another branch, you can comment out the readabilipy function. I'll test it and see if I can make it work, if not we'll remove it.
The length check is a good idea, the rationale would be "if there is much more text to be found in further <article> tags than in the first one, take all <article> tags, if not (teasers) only take the first one".
The tests are now failing but it's not a problem at this stage, feel free to work on it later on, no hurry.

adbar · 2024-03-15T15:57:37Z

@hugoobauer Are you still working on it or should I close the PR for now?

hugoobauer · 2024-03-18T17:41:50Z

Hi @adbar, sorry I'm too busy with work right now, I don't have time to work on it. You can close it for now, hopefully I'll be able to work on it later.

hugoobauer added 2 commits January 22, 2024 15:19

fix comparison tests

3bd925e

handles cases where multiple nodes are returned by the XPath (such as…

65eccab

… <article>)

add a min extracted size when merging multiple nodes

3f50d4e

adbar mentioned this pull request Jan 26, 2024

Extract more text #488

Open

adbar closed this Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge multiple nodes returned by XPath #487

Merge multiple nodes returned by XPath #487

hugoobauer commented Jan 24, 2024

adbar commented Jan 24, 2024

codecov bot commented Jan 24, 2024 •

edited

Loading

hugoobauer commented Jan 24, 2024 •

edited

Loading

adbar commented Jan 25, 2024

hugoobauer commented Jan 25, 2024

adbar commented Jan 26, 2024

adbar commented Mar 15, 2024

hugoobauer commented Mar 18, 2024

Merge multiple nodes returned by XPath #487

Merge multiple nodes returned by XPath #487

Conversation

hugoobauer commented Jan 24, 2024

adbar commented Jan 24, 2024

codecov bot commented Jan 24, 2024 • edited Loading

Codecov Report

hugoobauer commented Jan 24, 2024 • edited Loading

adbar commented Jan 25, 2024

hugoobauer commented Jan 25, 2024

adbar commented Jan 26, 2024

adbar commented Mar 15, 2024

hugoobauer commented Mar 18, 2024

codecov bot commented Jan 24, 2024 •

edited

Loading

hugoobauer commented Jan 24, 2024 •

edited

Loading