Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firefox does not work with proxy. #320

Open
bboyadao opened this issue Sep 20, 2024 · 5 comments
Open

Firefox does not work with proxy. #320

bboyadao opened this issue Sep 20, 2024 · 5 comments

Comments

@bboyadao
Copy link

I just create an example spider.
Chromium works well. but with the setup below. it's raise NS_ERROR_PROXY_CONNECTION_REFUSED from playwright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED

Debug to in ScrapyPlaywrightDownloadHandler._maybe_launch_browser and i got launch_options.

async def _maybe_launch_browser(self) -> None:
    async with self.browser_launch_lock:
        if not hasattr(self, "browser"):
            logger.info("Launching browser %s", self.browser_type.name)
            self.browser = await self.browser_type.launch(**self.config.launch_options)
            logger.info("Browser %s launched", self.browser_type.name)
            self.stats.inc_value("playwright/browser_count")
            self.browser.on("disconnected", self._browser_disconnected_callback)

And i copy it to playwright to test and it's works.

example_spider.py

import scrapy
from rich import print


class ExampleSpider(scrapy.Spider):
    name = "ex"
    start_urls = ["https://httpbin.org/get"]
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_BROWSER_TYPE": "firefox",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": False,
            "timeout": 20 * 1000,
            'proxy': {
                'server': '127.0.0.1:8888',
                'username': 'username',
                'password': 'password'
            }
        },
    }
    
    def start_requests(self):
        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse_detail,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_context_kwargs=dict(
                    java_script_enabled=True,
                    ignore_https_errors=True,
                ),
            
            )
        )
    
    async def parse_detail(self, response):
        print(f"Received response from {response.url}")
        yield {}

test_with_playwright.py

import asyncio

from playwright.async_api import async_playwright


async def run_playwright_with_proxy():
    kwargs = {
        'headless': False, 
        'timeout': 20000,
        'proxy': {
            'server': '127.0.0.1:8888',
            'username': 'username',
            'password': 'password'
        }
    }
    
    async with async_playwright() as p:
        browser = await p.firefox.launch(**kwargs)
        page = await browser.new_page()
        await page.goto("https://httpbin.org/get")
        await asyncio.sleep(100)
        print("Page Title:", await page.title())
        await browser.close()


if __name__ == "__main__":
    asyncio.run(run_playwright_with_proxy())
@elacuesta
Copy link
Member

I can not reproduce with mitmproxy:

$ mitmproxy --proxyauth "user:pass"

Screenshot at 2024-09-23 10-21-46

Slightly adapted sample spider:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "ex"
    start_urls = ["https://httpbin.org/get"]
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "PLAYWRIGHT_BROWSER_TYPE": "firefox",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": False,
            "timeout": 20 * 1000,
            'proxy': {
                "server": "127.0.0.1:8080",
                "username": "user",
                "password": "pass",
            }
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse_detail,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_context_kwargs=dict(
                    java_script_enabled=True,
                    ignore_https_errors=True,
                ),

            )
        )

    async def parse_detail(self, response):
        print(f"Received response from {response.url}")
        page = response.meta["playwright_page"]
        await page.close()
$ scrapy runspider proxy.py
(...)
2024-09-23 10:21:22 [scrapy.core.engine] INFO: Spider opened
2024-09-23 10:21:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-23 10:21:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-23 10:21:22 [scrapy-playwright] INFO: Starting download handler
2024-09-23 10:21:22 [scrapy-playwright] INFO: Starting download handler
2024-09-23 10:21:27 [scrapy-playwright] INFO: Launching browser firefox
2024-09-23 10:21:27 [scrapy-playwright] INFO: Browser firefox launched
2024-09-23 10:21:27 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-09-23 10:21:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Received response from https://httpbin.org/get
2024-09-23 10:21:29 [scrapy.core.engine] INFO: Closing spider (finished)
(...)

Which proxy are you using? Perhaps this is an interaction with that specific provider.

@bboyadao
Copy link
Author


2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-09-23 10:21:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Received response from https://httpbin.org/get

I have some thoughts

  • Look like scrapy got 407 at first.
  • Next request handled by playwright.

In my case scrapy got 407 then set it failure.

I use https://scrapoxy.io to manage proxies.

@elacuesta
Copy link
Member

  • Look like scrapy got 407 at first.
  • Next request handled by playwright.

All requests were routed through Playwright, notice the "scrapy-playwright" logger name:

2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>

The provided spider works correctly with Scrapoxy. I've started it as indicated in their docs and I'm getting the following logs. There is a failure downloading the response, but that's reasonable because I did not add an actual proxy provider in the Scrapoxy configuration site.

2024-09-24 10:53:10 [scrapy.core.engine] INFO: Spider opened
2024-09-24 10:53:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-24 10:53:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-24 10:53:10 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:10 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:15 [scrapy-playwright] INFO: Launching browser firefox
2024-09-24 10:53:16 [scrapy-playwright] INFO: Browser firefox launched
2024-09-24 10:53:16 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:17 [scrapy-playwright] DEBUG: [Context=default] Response: <557 https://httpbin.org/get>
2024-09-24 10:53:17 [scrapy.core.engine] DEBUG: Crawled (557) <GET https://httpbin.org/get> (referer: None) ['playwright']
2024-09-24 10:53:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <557 https://httpbin.org/get>: HTTP status code is not handled or not allowed
2024-09-24 10:53:17 [scrapy.core.engine] INFO: Closing spider (finished)

However, if I pass incorrect credentials I do get the reported message:

2024-09-24 10:53:37 [scrapy.core.engine] INFO: Spider opened
2024-09-24 10:53:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-24 10:53:37 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-24 10:53:37 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:37 [scrapy-playwright] INFO: Starting download handler
2024-09-24 10:53:42 [scrapy-playwright] INFO: Launching browser firefox
2024-09-24 10:53:42 [scrapy-playwright] INFO: Browser firefox launched
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-24 10:53:43 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-24 10:53:43 [scrapy.core.scraper] ERROR: Error downloading <GET https://httpbin.org/get>
Traceback (most recent call last):
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1999, in _inlineCallbacks
    result = context.run(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/python/failure.py", line 519, in throwExceptionIntoGenerator
    return g.throw(self.value.with_traceback(self.tb))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 54, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/twisted/internet/defer.py", line 1251, in adapt
    extracted: _SelfResultT | Failure = result.result()
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 378, in _download_request
    return await self._download_request_with_retry(request=request, spider=spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 431, in _download_request_with_retry
    return await self._download_request_with_page(request, page, spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 460, in _download_request_with_page
    response, download = await self._get_response_and_download(request, page, spider)
  File "/.../scrapy-playwright/scrapy_playwright/handler.py", line 560, in _get_response_and_download
    response = await page.goto(url=request.url, **page_goto_kwargs)
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 8805, in goto
    await self._impl_obj.goto(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_page.py", line 524, in goto
    return await self._main_frame.goto(**locals_to_params(locals()))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 145, in goto
    await self._channel.send("goto", locals_to_params(locals()))
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 59, in send
    return await self._connection.wrap_api_call(
  File "/.../scrapy-playwright/venv-scrapy-playwright/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 514, in wrap_api_call
    raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None
playwright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED
Call log:
navigating to "https://httpbin.org/get", waiting until "load"

2024-09-24 10:53:43 [scrapy.core.engine] INFO: Closing spider (finished)

@honzajavorek
Copy link

honzajavorek commented Oct 8, 2024

I also experienced NS_ERROR_PROXY_CONNECTION_REFUSED with Firefox. I'm pretty sure my proxy settings were right, but given the task at hand, my hunch is that this happens when the target blocks the proxy. I switched to Chromium just to test if the same scraper works better, and I get no errors. It's quite slow though, so superficially it seems that when the proxy gets blocked, scrapy-playwright knows how to recover and retry in case of Chromium, but fails with NS_ERROR_PROXY_CONNECTION_REFUSED in case of Firefox.

Update: With Chromium I get playwright._impl._errors.Error: Page.goto: net::ERR_INVALID_ARGUMENT instead 🤷‍♂️ Doesn't help me then to switch browsers, but perhaps this helps with figuring out what's the actual underlying problem.

@sailod
Copy link

sailod commented Jan 4, 2025

I can not reproduce with mitmproxy:

$ mitmproxy --proxyauth "user:pass"

Screenshot at 2024-09-23 10-21-46

Slightly adapted sample spider:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "ex"
    start_urls = ["https://httpbin.org/get"]
    custom_settings = {
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "PLAYWRIGHT_BROWSER_TYPE": "firefox",
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "headless": False,
            "timeout": 20 * 1000,
            'proxy': {
                "server": "127.0.0.1:8080",
                "username": "user",
                "password": "pass",
            }
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            url=self.start_urls[0],
            callback=self.parse_detail,
            meta=dict(
                playwright=True,
                playwright_include_page=True,
                playwright_context_kwargs=dict(
                    java_script_enabled=True,
                    ignore_https_errors=True,
                ),

            )
        )

    async def parse_detail(self, response):
        print(f"Received response from {response.url}")
        page = response.meta["playwright_page"]
        await page.close()
$ scrapy runspider proxy.py
(...)
2024-09-23 10:21:22 [scrapy.core.engine] INFO: Spider opened
2024-09-23 10:21:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-09-23 10:21:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-09-23 10:21:22 [scrapy-playwright] INFO: Starting download handler
2024-09-23 10:21:22 [scrapy-playwright] INFO: Starting download handler
2024-09-23 10:21:27 [scrapy-playwright] INFO: Launching browser firefox
2024-09-23 10:21:27 [scrapy-playwright] INFO: Browser firefox launched
2024-09-23 10:21:27 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <407 https://httpbin.org/get>
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://httpbin.org/get> (resource type: document)
2024-09-23 10:21:28 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://httpbin.org/get>
2024-09-23 10:21:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None) ['playwright']
Received response from https://httpbin.org/get
2024-09-23 10:21:29 [scrapy.core.engine] INFO: Closing spider (finished)
(...)

Which proxy are you using? Perhaps this is an interaction with that specific provider.

did you try in headless mode? reproduced with same config you specified besides the headless mode (headless: True)
plus Ive been running it inside a container
maybe related to this case:
microsoft/playwright#33663
even though I didn't set any specific UA or other config that should mess up with the headers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants