Skip to content

Conversation

@addie9800
Copy link
Collaborator

@addie9800 addie9800 commented May 4, 2025

During the debugging of the publisher coverage, several issues arose:

  • We spend unnecessary amount of time crawling publishers in the publisher coverage, where the action bot is blocked
  • We cannot deal with urls that are of the form www.test.de%2Fthis-is-the-site.html
  • We cannot properly parse JSON with line breaks within strings (occurs when crawling FreiePresse).
    This PR adresses these three issues.

Update:

The final commit adds the following further changes:

  • Call task_done() on response_queue solving the main deadlock issue
  • Fixes a bug, where if a sitemap only contains a newline character, it is not detected as an empty sitemap, causing Fundus to crash
  • Moves the pool out of the context manager setup, because the later call of pool.join() can cause unexpected behavior, since it is implicitly called when exiting the context manager. To maintain control, I decided to handle it manually
  • Modifies the publisher_coverage.yaml to let the shell timeout the program before it is killed by GitHub allowing the output of the program to still be recorded and accessible at a later point

@addie9800 addie9800 requested a review from Copilot May 4, 2025 16:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses three issues: reducing unnecessary crawling of blocked publishers, decoding percent-encoded URLs, and sanitizing JSON strings with unexpected line breaks.

  • Added a helper function (clean_url) in the URL scraper to decode URLs.
  • Updated the JSON sanitizer to remove newline and tab characters.
  • Introduced a blocked publishers list check in the publisher coverage script.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/fundus/scraping/url.py Adds clean_url to decode percent-encoded URLs before yielding them in crawlers
src/fundus/parser/utility.py Adjusts the JSON sanitizer to remove newlines and tabs to better handle malformed JSON
scripts/publisher_coverage.py Implements a check to skip crawling for certain blocked publishers

@addie9800 addie9800 requested a review from MaxDall May 4, 2025 16:55
@addie9800 addie9800 requested a review from Copilot May 22, 2025 13:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses several bugs affecting publisher coverage by improving URL handling, adjusting thread pool usage to prevent deadlocks, and refining JSON sanitization. Key changes include:

  • Introducing a new helper (clean_url) to decode URLs using unquote.
  • Moving ThreadPool initialization outside the context manager and manually handling its shutdown.
  • Refining JSON sanitization by stripping out newlines and tabs, and adding blocked publishers handling and a workflow timeout.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/fundus/scraping/url.py Added clean_url function and applied URL decoding in feed parsing and sitemap recursion.
src/fundus/scraping/session.py Added a task_done() call before exiting to mitigate deadlock issues.
src/fundus/scraping/crawler.py Moved ThreadPool initialization outside the context manager to control shutdown manually.
src/fundus/parser/utility.py Updated JSON sanitization to remove newline and tab characters from the input.
scripts/publisher_coverage.py Introduced a blocked publishers list to skip crawling for certain publishers.
.github/workflows/publisher_coverage.yaml Modified the script run command to use a timeout, ensuring program output is retained.
Comments suppressed due to low confidence (1)

src/fundus/scraping/url.py:188

  • Ensure that trimming the response content here does not inadvertently remove necessary whitespace, especially for non-string or compressed content that may depend on specific formatting.
content = response.content.strip()

@addie9800 addie9800 requested review from dobbersc and removed request for MaxDall May 22, 2025 13:22
@addie9800 addie9800 requested a review from dobbersc June 3, 2025 23:10
Copy link
Collaborator

@dobbersc dobbersc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again :).

@addie9800 addie9800 merged commit 5f1b884 into master Jun 5, 2025
4 checks passed
@MaxDall MaxDall deleted the publisher-coverage-bug-fixes branch December 10, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants