Images requiring a Referer header are not fetched #172

jinshuqishi2019 · 2024-05-24T04:22:51Z

Environment

Operating System: debian 10
node --version: v20.12.2
npm --version: 10.7.0
percollate --version:v4.1.1

Description

Hello!
When using percollate epub to generate EPUB files, I sometimes notice missing images or text. I found that some websites require the Referer parameter to be set for images to prevent hotlinking; otherwise, the images show a 403 error when downloaded. The missing text issue is due to the content being dynamically generated by JavaScript. Do you have any good solutions for these two situations? Thank you.

The text was updated successfully, but these errors were encountered:

danburzo · 2024-05-25T06:38:57Z

Do you happen to have an example command for a page whose images have hotlink prevention? There may be something we can do about it, if we fetch them like the browser would, by respecting the Referrer Policy.

As for content generated with JavaScript, percollate does not run the original webpage in Puppeteer (Chromium), so you must fetch the page externally. For example, monolith suggests using chromium:

chromium --headless --incognito --dump-dom https://github.com | monolith - -I -b https://github.com -o github.html

You could do something similar with Percollate:

chromium --headless --incognito --dump-dom https://github.com | percollate --url https://github.com -

jinshuqishi2019 · 2024-05-26T02:29:35Z

Example of missing images1:

percollate html https://sspai.com/post/88977

Perhaps the percollate can be used to add custom request headers.
If the Referer: https://sspai.com/ is added to the picture request header, it can be displayed normally.

Example of missing images2:
Referring to the instructions above, I tried to operate, but there were still problems. It would be better if there were more detailed explanations.

apt-get install chromium
chromium --headless --incognito --dump-dom --no-sandbox https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw | percollate html https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw

I can see the content, but still missing images

danburzo · 2024-05-27T05:41:42Z

Thanks for the test case. It seems that PDF generation has the same issue with images, because of the lack of Referer. I will look into what can be done.

As for the chromium command, the arguments to percollate are a bit obscure so let me unpack them a bit:

chromium --headless --incognito --dump-dom https://github.com | percollate --url https://github.com -

Notice the - argument at the very end: this makes Percollate read the HTML from the standard input instead of from an URL. This way it loads the result of the chromium command. Since we get the raw HTML, we must then provide the original URL manually, with the --url option, so that relative links (and other features) work correctly. See also this README section.

…172

* Set 'referrer' and 'referrerPolicy' when fetching inline images, re: #172 * Also send referrer when fetching images for EPUB.

danburzo · 2024-05-27T07:29:10Z

Released a fix as part of [email protected]. For EPUBs, you can run the command as usual. For HTML and PDF output, you’ll need to use the --inline flag for images that require a Referer header.

danburzo added the Needs investigation label May 25, 2024

danburzo changed the title ~~Image and text missing~~ Images not fetched for EPUB; handling dynamic content May 25, 2024

danburzo changed the title ~~Images not fetched for EPUB; handling dynamic content~~ Images requiring a Referer header are not fetched May 27, 2024

danburzo added a commit that referenced this issue May 27, 2024

Set 'referrer' and 'referrerPolicy' when fetching inline images, re: #…

cf4d959

…172

danburzo mentioned this issue May 27, 2024

Set 'referrer' and 'referrerPolicy' when fetching images #173

Merged

danburzo closed this as completed in #173 May 27, 2024

danburzo added a commit that referenced this issue May 27, 2024

Set 'referrer' and 'referrerPolicy' when fetching images (#173)

c8d4eb3

* Set 'referrer' and 'referrerPolicy' when fetching inline images, re: #172 * Also send referrer when fetching images for EPUB.

jinshuqishi2019 mentioned this issue May 27, 2024

Web pages cannot correctly identify and download image links. #174

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Images requiring a Referer header are not fetched #172

Images requiring a Referer header are not fetched #172

jinshuqishi2019 commented May 24, 2024

danburzo commented May 25, 2024

jinshuqishi2019 commented May 26, 2024 •

edited

Loading

danburzo commented May 27, 2024

danburzo commented May 27, 2024

Images requiring a Referer header are not fetched #172

Images requiring a Referer header are not fetched #172

Comments

jinshuqishi2019 commented May 24, 2024

Environment

Description

danburzo commented May 25, 2024

jinshuqishi2019 commented May 26, 2024 • edited Loading

danburzo commented May 27, 2024

danburzo commented May 27, 2024

jinshuqishi2019 commented May 26, 2024 •

edited

Loading