-
-
Notifications
You must be signed in to change notification settings - Fork 745
Using proxies
Using a proxy service is essential when scraping sites. Datacenter IPs are often blocked, so you're best off with a residential proxy service. See here a list of proxy services.
Since proxy services require authentication, this is not as straight-forward as passing a startup flag to Chrome. There are some helper libraries out there:
- https://github.com/Cuadrix/puppeteer-page-proxy
- https://github.com/gajus/puppeteer-proxy
- https://github.com/apify/proxy-chain
All of these have some kind of issue related to stealthyness though (headers being removed, wrongly capitalized, DNS leaks, etc). The best approach for stealthyness is to use 3proxy instead, as a layer between the browser and the proxy service.
Download the latest 3proxy from https://3proxy.ru/download/stable/ and use the following config file as a starter:
daemon
pidfile /tmp/3proxy.pid
maxconn 2048
log /tmp/3proxy.log
logformat "L%O %I %T"
auth iponly
fakeresolve
allow * 127.0.0.1 * *
parent 1000 http IP PORT USER PASS
proxy -p23001 -i127.0.0.1 -a
Change IP
with the IP address of the proxy server, e.g. resolved with dig +short proxy.server.com
(the reason we do the lookup before writing the config file is to avoid DNS leaks). PORT
is the proxy server port, and USER
and PASS
are the username and password. Start 3proxy with 3proxy /path/to/config-file.cfg
, now you can start your Puppeteer browser with the launch flag --proxy-server=localhost:23001
.
To watch your logfile, check tail -f /tmp/3proxy.log
. To change your proxy settings (e.g. change your IP by changing a session identifier in the username), edit the logfile, then send the SIGUSR1
signal to the 3proxy PID in /tmp/3proxy.pid
.