Bug Fixes from Publisher Coverage #753

addie9800 · 2025-05-04T16:52:21Z

During the debugging of the publisher coverage, several issues arose:

We spend unnecessary amount of time crawling publishers in the publisher coverage, where the action bot is blocked
We cannot deal with urls that are of the form www.test.de%2Fthis-is-the-site.html
We cannot properly parse JSON with line breaks within strings (occurs when crawling FreiePresse).
This PR adresses these three issues.

Update:

The final commit adds the following further changes:

Call task_done() on response_queue solving the main deadlock issue
Fixes a bug, where if a sitemap only contains a newline character, it is not detected as an empty sitemap, causing Fundus to crash
Moves the pool out of the context manager setup, because the later call of pool.join() can cause unexpected behavior, since it is implicitly called when exiting the context manager. To maintain control, I decided to handle it manually
Modifies the publisher_coverage.yaml to let the shell timeout the program before it is killed by GitHub allowing the output of the program to still be recorded and accessible at a later point

Copilot

Pull Request Overview

This PR addresses three issues: reducing unnecessary crawling of blocked publishers, decoding percent-encoded URLs, and sanitizing JSON strings with unexpected line breaks.

Added a helper function (clean_url) in the URL scraper to decode URLs.
Updated the JSON sanitizer to remove newline and tab characters.
Introduced a blocked publishers list check in the publisher coverage script.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
src/fundus/scraping/url.py	Adds clean_url to decode percent-encoded URLs before yielding them in crawlers
src/fundus/parser/utility.py	Adjusts the JSON sanitizer to remove newlines and tabs to better handle malformed JSON
scripts/publisher_coverage.py	Implements a check to skip crawling for certain blocked publishers

src/fundus/scraping/url.py

src/fundus/parser/utility.py

scripts/publisher_coverage.py

Copilot

Pull Request Overview

This PR addresses several bugs affecting publisher coverage by improving URL handling, adjusting thread pool usage to prevent deadlocks, and refining JSON sanitization. Key changes include:

Introducing a new helper (clean_url) to decode URLs using unquote.
Moving ThreadPool initialization outside the context manager and manually handling its shutdown.
Refining JSON sanitization by stripping out newlines and tabs, and adding blocked publishers handling and a workflow timeout.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/fundus/scraping/url.py	Added clean_url function and applied URL decoding in feed parsing and sitemap recursion.
src/fundus/scraping/session.py	Added a task_done() call before exiting to mitigate deadlock issues.
src/fundus/scraping/crawler.py	Moved ThreadPool initialization outside the context manager to control shutdown manually.
src/fundus/parser/utility.py	Updated JSON sanitization to remove newline and tab characters from the input.
scripts/publisher_coverage.py	Introduced a blocked publishers list to skip crawling for certain publishers.
.github/workflows/publisher_coverage.yaml	Modified the script run command to use a timeout, ensuring program output is retained.

Comments suppressed due to low confidence (1)

src/fundus/scraping/url.py:188

Ensure that trimming the response content here does not inadvertently remove necessary whitespace, especially for non-string or compressed content that may depend on specific formatting.

content = response.content.strip()

src/fundus/scraping/crawler.py

src/fundus/parser/utility.py

scripts/publisher_coverage.py

src/fundus/parser/utility.py

dobbersc

Thanks again :).

addie9800 added 3 commits May 4, 2025 18:46

add blocked publishers

e57c870

unquote urls by default

52830af

clean json string

45be452

addie9800 requested a review from Copilot May 4, 2025 16:52

black

0cdd9ea

Copilot AI reviewed May 4, 2025

View reviewed changes

src/fundus/scraping/url.py Show resolved Hide resolved

src/fundus/parser/utility.py Outdated Show resolved Hide resolved

scripts/publisher_coverage.py Outdated Show resolved Hide resolved

addie9800 requested a review from MaxDall May 4, 2025 16:55

fix publisher_coverage deadlock issue

4c2db19

addie9800 requested a review from Copilot May 22, 2025 13:18

Copilot AI reviewed May 22, 2025

View reviewed changes

src/fundus/scraping/crawler.py Show resolved Hide resolved

src/fundus/parser/utility.py Outdated Show resolved Hide resolved

addie9800 requested review from dobbersc and removed request for MaxDall May 22, 2025 13:22

dobbersc requested changes May 29, 2025

View reviewed changes

scripts/publisher_coverage.py Outdated Show resolved Hide resolved

src/fundus/parser/utility.py Outdated Show resolved Hide resolved

src/fundus/parser/utility.py Outdated Show resolved Hide resolved

addie9800 added 4 commits May 29, 2025 18:18

add \r to json sanitizing

575a32d

add --skip option

ab0a43d

input skipped publishers as list

195d375

formatting

03cbb08

addie9800 requested a review from dobbersc June 3, 2025 23:10

dobbersc approved these changes Jun 5, 2025

View reviewed changes

addie9800 merged commit 5f1b884 into master Jun 5, 2025
4 checks passed

MaxDall deleted the publisher-coverage-bug-fixes branch December 10, 2025 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug Fixes from Publisher Coverage #753

Bug Fixes from Publisher Coverage #753

Uh oh!

addie9800 commented May 4, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dobbersc left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bug Fixes from Publisher Coverage #753

Bug Fixes from Publisher Coverage #753

Uh oh!

Conversation

addie9800 commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dobbersc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

addie9800 commented May 4, 2025 •

edited

Loading