Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing images in Wikipedia articles #141

Open
WolfgangDpunkt opened this issue Oct 13, 2022 · 6 comments
Open

Missing images in Wikipedia articles #141

WolfgangDpunkt opened this issue Oct 13, 2022 · 6 comments

Comments

@WolfgangDpunkt
Copy link

Environment

  • Operating System: debian (aarch64)
  • node --version: v17.9.0
  • npm --version: 8.18.0
  • yarn --version, if using Yarn:
  • percollate --version: v2.2.0

Description

When I convert Wikipedia articles to epubs with this otherwise great and very useful tool, some of the images get lost. An adblocker is not used in this environment.

Here is my command line
percollate epub --individual --output /home/Perco-Epubs/ https://en.wikipedia.org/wiki/Canada --debug

And here is the resulting epub. I had to zip it, as Github does not accept epub files:
-Canada.epub.zip

And here's the direct comparison, in the "British North America" section the web version has two images, the epub version zero.
Bildschirmfoto 2022-10-13 um 10 14 06

There are indeed images in the epub, percollate does not ignore all images, but most of them.
What could be the reason? Thanks a lot!

Here comes the debug log:

~# percollate epub --individual --output /home/_Perco-Epubs/ https://en.wikipedia.org/wiki/Canada --debug
{
  command: 'epub',
  operands: [ 'https://en.wikipedia.org/wiki/Canada' ],
  opts: {
    individual: true,
    output: '/home/_Perco-Epubs/',
    debug: true
  }
}
Fetching: https://en.wikipedia.org/wiki/Canada ✓
Enhancing web page: https://en.wikipedia.org/wiki/Canada ✓
Saving EPUB...
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Canada_%28Pantone%29.svg/125px-Flag_of_Canada_%28Pantone%29.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/en/thumb/4/4f/Coat_of_arms_of_Canada.svg/85px-Coat_of_arms_of_Canada.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/6/67/CAN_orthographic.svg/220px-CAN_orthographic.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/Decrease_Positive.svg/11px-Decrease_Positive.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Nouvelle-France_map-en.svg/260px-Nouvelle-France_map-en.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Canada_WWI_l%27Emprunt_de_la_Victoire2.jpg/135px-Canada_WWI_l%27Emprunt_de_la_Victoire2.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Canada_WWI_Victory_Bonds2.jpg/136px-Canada_WWI_Victory_Bonds2.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Canada_topo.jpg/260px-Canada_topo.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Canada_K%C3%B6ppen.svg/260px-Canada_K%C3%B6ppen.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Toronto_from_above_at_night.jpg/240px-Toronto_from_above_at_night.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/FTAs_with_Canada.svg/260px-FTAs_with_Canada.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/2/2d/STS-116_-_P5_Truss_hand-off_to_ISS_%28NASA_S116-E-05765%29.jpg/220px-STS-116_-_P5_Truss_hand-off_to_ISS_%28NASA_S116-E-05765%29.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/Censusdivisions-ethnic.png/240px-Censusdivisions-ethnic.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Statue_outside_Union_Station.jpg/170px-Statue_outside_Union_Station.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/CBC_Radio_Canada_Chevrolet_Express_02.jpg/220px-CBC_Radio_Canada_Chevrolet_Express_02.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/O-Canada-1908.pdf/page1-170px-O-Canada-1908.pdf.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Canada2010WinterOlympicsOTcelebration.jpg/220px-Canada2010WinterOlympicsOTcelebration.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/Sound-icon.svg/45px-Sound-icon.svg.png
1141364 total bytes, archive closed
Saved EPUB: /home/_Perco-Epubs/-Canada.epub
@danburzo
Copy link
Owner

danburzo commented Oct 14, 2022

Thanks @WolfgangDpunkt for the report, the issue should be fixed in version 2.2.1

@WolfgangDpunkt
Copy link
Author

Thank you very much! I have completed the update and progress is noticeable. Indeed, it now works with the example article "Canada" from the English Wikipedia.
But, I'm afraid, the problem is not yet completely solved.

If you can find the patience to work on this problem further, I would be happy. Since there are hardly any other reliable tools to convert wiki articles to epub books via command line, I think the bug has a high relevance.

In this article, for example, almost all the pictures are missing:
https://de.wikipedia.org/wiki/Wien

However, there does not seem to be a fundamental problem with international language versions of Wikipedia.
Because in the English article version "Vienna" there are a lot of pictures included in the epub, but not all of them:
https://en.wikipedia.org/wiki/Vienna#Culinary_specialities

The photo "Sachertorte" is missing in the epub, for example:
Bildschirmfoto 2022-10-17 um 09 23 52

In fact, the debug log does not mention the filename of this photo either, for whatever reason this photo is ignored during the download (https://upload.wikimedia.org/wikipedia/commons/b/b8/Sachertorte_DSC03027.JPG)

@danburzo danburzo reopened this Oct 17, 2022
@danburzo
Copy link
Owner

Thanks for pointing out the broken pages, it will help out with debugging. This is mostly Readability removing the images, I will investigate how to prevent that from
happening.

@danburzo
Copy link
Owner

Seems that the HTML markup for images in Wikipedia is going to change soon: https://diff.wikimedia.org/2022/11/28/tech-news-2022-48/ (via @simevidas), so that may make handling them a bit easier.

danburzo added a commit that referenced this issue Mar 7, 2023
…e images to not be fetched for EPUB archive (Re: #141)
@danburzo
Copy link
Owner

danburzo commented Mar 7, 2023

It turns out that there was more than one issue at play preventing one image or the other from being properly fetched/bundled:

  • on non-English Wikipedia pages, URLs pointing to what look like images but are in fact HTML pages were not excluded, due the assumption they'd match wiki/File:. In fact, the File: part of the URL is localized, so you could have Fișier: or Datei:. Thanks @vongrad for investigating and submitting a patch!
  • additionally, regexes for matching image URLs were scattered in the codebase, and one of them was unintentionally case-sensitive, meaning it didn't match upercase filenames such as Sachertorte_DSC03027.JPG.

There may be additional issues with Readability as mentioned in earlier comments, but I'm confident upgrading to [email protected] will fix a lot of Wikipedia images.

@WolfgangDpunkt
Copy link
Author

Dear @danburzo and @vongrad ,

I am very grateful for your work and your attention to my questions. This will help me a lot. I will test the new version as soon as possible. I appreciate your dedication very much.
Kudos for how patiently you troubleshot this issue.
Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants