Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images requiring a Referer header are not fetched #172

Closed
jinshuqishi2019 opened this issue May 24, 2024 · 4 comments · Fixed by #173
Closed

Images requiring a Referer header are not fetched #172

jinshuqishi2019 opened this issue May 24, 2024 · 4 comments · Fixed by #173

Comments

@jinshuqishi2019
Copy link

Environment

  • Operating System: debian 10
  • node --version: v20.12.2
  • npm --version: 10.7.0
  • percollate --version:v4.1.1

Description

Hello!
When using percollate epub to generate EPUB files, I sometimes notice missing images or text. I found that some websites require the Referer parameter to be set for images to prevent hotlinking; otherwise, the images show a 403 error when downloaded. The missing text issue is due to the content being dynamically generated by JavaScript. Do you have any good solutions for these two situations? Thank you.

@danburzo
Copy link
Owner

Do you happen to have an example command for a page whose images have hotlink prevention? There may be something we can do about it, if we fetch them like the browser would, by respecting the Referrer Policy.

As for content generated with JavaScript, percollate does not run the original webpage in Puppeteer (Chromium), so you must fetch the page externally. For example, monolith suggests using chromium:

chromium --headless --incognito --dump-dom https://github.com | monolith - -I -b https://github.com -o github.html

You could do something similar with Percollate:

chromium --headless --incognito --dump-dom https://github.com | percollate --url https://github.com -

@danburzo danburzo changed the title Image and text missing Images not fetched for EPUB; handling dynamic content May 25, 2024
@jinshuqishi2019
Copy link
Author

jinshuqishi2019 commented May 26, 2024

Example of missing images1:

percollate html https://sspai.com/post/88977

Perhaps the percollate can be used to add custom request headers.
If the Referer: https://sspai.com/ is added to the picture request header, it can be displayed normally.

Example of missing images2:
Referring to the instructions above, I tried to operate, but there were still problems. It would be better if there were more detailed explanations.

apt-get install chromium
chromium --headless --incognito --dump-dom --no-sandbox https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw | percollate html https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw

I can see the content, but still missing images

@danburzo
Copy link
Owner

Thanks for the test case. It seems that PDF generation has the same issue with images, because of the lack of Referer. I will look into what can be done.

As for the chromium command, the arguments to percollate are a bit obscure so let me unpack them a bit:

chromium --headless --incognito --dump-dom https://github.com | percollate --url https://github.com -

Notice the - argument at the very end: this makes Percollate read the HTML from the standard input instead of from an URL. This way it loads the result of the chromium command. Since we get the raw HTML, we must then provide the original URL manually, with the --url option, so that relative links (and other features) work correctly. See also this README section.

@danburzo danburzo changed the title Images not fetched for EPUB; handling dynamic content Images requiring a Referer header are not fetched May 27, 2024
danburzo added a commit that referenced this issue May 27, 2024
* Set 'referrer' and 'referrerPolicy' when fetching inline images, re: #172

* Also send referrer when fetching images for EPUB.
@danburzo
Copy link
Owner

Released a fix as part of [email protected]. For EPUBs, you can run the command as usual. For HTML and PDF output, you’ll need to use the --inline flag for images that require a Referer header.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants