-
Notifications
You must be signed in to change notification settings - Fork 107
Bug Fixes from Publisher Coverage #753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses three issues: reducing unnecessary crawling of blocked publishers, decoding percent-encoded URLs, and sanitizing JSON strings with unexpected line breaks.
- Added a helper function (clean_url) in the URL scraper to decode URLs.
- Updated the JSON sanitizer to remove newline and tab characters.
- Introduced a blocked publishers list check in the publisher coverage script.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/fundus/scraping/url.py | Adds clean_url to decode percent-encoded URLs before yielding them in crawlers |
| src/fundus/parser/utility.py | Adjusts the JSON sanitizer to remove newlines and tabs to better handle malformed JSON |
| scripts/publisher_coverage.py | Implements a check to skip crawling for certain blocked publishers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses several bugs affecting publisher coverage by improving URL handling, adjusting thread pool usage to prevent deadlocks, and refining JSON sanitization. Key changes include:
- Introducing a new helper (clean_url) to decode URLs using unquote.
- Moving ThreadPool initialization outside the context manager and manually handling its shutdown.
- Refining JSON sanitization by stripping out newlines and tabs, and adding blocked publishers handling and a workflow timeout.
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/fundus/scraping/url.py | Added clean_url function and applied URL decoding in feed parsing and sitemap recursion. |
| src/fundus/scraping/session.py | Added a task_done() call before exiting to mitigate deadlock issues. |
| src/fundus/scraping/crawler.py | Moved ThreadPool initialization outside the context manager to control shutdown manually. |
| src/fundus/parser/utility.py | Updated JSON sanitization to remove newline and tab characters from the input. |
| scripts/publisher_coverage.py | Introduced a blocked publishers list to skip crawling for certain publishers. |
| .github/workflows/publisher_coverage.yaml | Modified the script run command to use a timeout, ensuring program output is retained. |
Comments suppressed due to low confidence (1)
src/fundus/scraping/url.py:188
- Ensure that trimming the response content here does not inadvertently remove necessary whitespace, especially for non-string or compressed content that may depend on specific formatting.
content = response.content.strip()
dobbersc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again :).
During the debugging of the publisher coverage, several issues arose:
www.test.de%2Fthis-is-the-site.htmlFreiePresse).This PR adresses these three issues.
Update:
The final commit adds the following further changes:
task_done()onresponse_queuesolving the main deadlock issuepoolout of the context manager setup, because the later call ofpool.join()can cause unexpected behavior, since it is implicitly called when exiting the context manager. To maintain control, I decided to handle it manuallypublisher_coverage.yamlto let the shell timeout the program before it is killed by GitHub allowing the output of the program to still be recorded and accessible at a later point