How the reader bypass the Cloudflare's protection #66

backrunner · 2024-05-22T17:43:02Z

I noticed that the reader can read things from pages in https://openai.com which has been highly protected by Cloudflare, if only use 'puppeteer-extra-plugin-stealth', it's not enough to bypass the Cloudflare's protection.

In the source code, there's nothing to solve the captcha automatically, and no more things about the protection bypass.

What I'd like to inquire about is whether you guys have some other or more under-the-hood changes for puppeteer that make the reader not be detected by cloudflare.

We're trying to privately deploy a similar service, but are having trouble getting a close approximation in terms of accessing page content, mainly because there's no way to get around the protection.

The text was updated successfully, but these errors were encountered:

nashdean · 2024-05-24T03:41:07Z

I am still getting stuck with Cloudflare on certain sites. This site for example still gets detected and stops the Reader from accessing the content: https://www.podbean.com/site/search/index?v={SEARCH+QUERY+HERE}. Before I found Jina-AI, I was able to bypass this by using seleniumbase BaseCase class. If you are trying a custom solution to avoid detection, I suggest checking out seleniumbase`. The creator has good documentation and many YouTube tutorials on its use (Its a wrapper of Selenium).

https://seleniumbase.io/help_docs/uc_mode/#uc-mode

Does the Reader use something similar to this? It would be nice if it could avoid detection on sites like the example I provided that is still getting caught.

nashdean · 2024-05-24T03:45:48Z

Looking into the code I see that puppeteer-extra-plugin-stealth is being used by Reader. Is the team going to be expanding to also add an option for SeleniumBase? It would be nice to have a uc-mode option which is what works for me right now with CloudFlare.

backrunner · 2024-05-24T05:30:52Z

puppeteer-extra-plugin-stealth

puppeteer-extra-plugin-stealth could be detected by Cloudflare, References: https://github.com/berstend/puppeteer-extra/issues?q=is%3Aissue+is%3Aopen+cloudflare

That' not such convinced to me if the reader just bypass the Cloudflare's protection only by this plugin, the last commit of this plugin is at least 1 year ago.

I just developed a very similar reader with all the evasions from plugin-stealth, and that doesn't work.

SeleniumBase seems like a good solution, but I prefer a solution for puppeteer currently then I can deploy it to the edge with my solution in TypeScript.

nomagick · 2024-05-28T07:03:53Z

That might be the simple salvage function which queries the Google web cache.

reader/backend/functions/src/services/puppeteer.ts

Lines 467 to 488 in 7c57123

 async salvage(url: string, page: Page) { 

 this.logger.info(`Salvaging ${url}`); 

 const googleArchiveUrl = `https://webcache.googleusercontent.com/search?q=cache:${encodeURIComponent(url)}`; 

 const resp = await fetch(googleArchiveUrl, { 

 headers: { 

 'User-Agent': `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)` 

 } 

 }); 

 resp.body?.cancel().catch(() => void 0); 

 if (!resp.ok) { 

 this.logger.warn(`No salvation found for url: ${url}`, { status: resp.status, url }); 

 return null; 

 } 

 await page.goto(googleArchiveUrl, { waitUntil: ['load', 'domcontentloaded', 'networkidle0'], timeout: 15_000 }).catch((err) => { 

 this.logger.warn(`Page salvation did not fully succeed.`, { err: marshalErrorLike(err) }); 

 }); 

 this.logger.info(`Salvation completed.`); 

 return true; 

 }

It's not guaranteed to work, though.

Alternative approaches may also include querying from the Web Archive.

However the puppeteer-extra-plugin-stealth somehow doesn't work at some level with Web Archive

Setting UA to some of the famous bots, like Slackbot, GPTBot, or even GoogleSpider, sometimes also works because the site owner accepts them, but in other cases, it triggers the site to block access directly.

backrunner · 2024-05-28T18:25:01Z

That might be the simple salvage function which queries the Google web cache.

reader/backend/functions/src/services/puppeteer.ts

Lines 467 to 488 in 7c57123

async salvage(url: string, page: Page) {

this.logger.info(`Salvaging ${url}`);

const googleArchiveUrl = `https://webcache.googleusercontent.com/search?q=cache:${encodeURIComponent(url)}`;

const resp = await fetch(googleArchiveUrl, {

headers: {

'User-Agent': `Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)`

}

});

resp.body?.cancel().catch(() => void 0);

if (!resp.ok) {

this.logger.warn(`No salvation found for url: ${url}`, { status: resp.status, url });

return null;

}

await page.goto(googleArchiveUrl, { waitUntil: ['load', 'domcontentloaded', 'networkidle0'], timeout: 15_000 }).catch((err) => {

this.logger.warn(`Page salvation did not fully succeed.`, { err: marshalErrorLike(err) });

});

this.logger.info(`Salvation completed.`);

return true;

}

It's not guaranteed to work, though.

Alternative approaches may also include querying from the Web Archive.

However the puppeteer-extra-plugin-stealth somehow doesn't work at some level with Web Archive

Setting UA to some of the famous bots, like Slackbot, GPTBot, or even GoogleSpider, sometimes also works because the site owner accepts them, but in other cases, it triggers the site to block access directly.

I found the issue, I'm trying to deploy it to the edge compute, but it seems like too many people requesting Google so it hit the rate limit. The salvage itself works fine.

Thanks for your reply :D

backrunner closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How the reader bypass the Cloudflare's protection #66

How the reader bypass the Cloudflare's protection #66

backrunner commented May 22, 2024

nashdean commented May 24, 2024

nashdean commented May 24, 2024

backrunner commented May 24, 2024 •

edited

nomagick commented May 28, 2024

backrunner commented May 28, 2024

How the reader bypass the Cloudflare's protection #66

How the reader bypass the Cloudflare's protection #66

Comments

backrunner commented May 22, 2024

nashdean commented May 24, 2024

nashdean commented May 24, 2024

backrunner commented May 24, 2024 • edited

nomagick commented May 28, 2024

backrunner commented May 28, 2024

backrunner commented May 24, 2024 •

edited