Open
Description
I just create an example spider.
Chromium works well. but with the setup below. it's raise NS_ERROR_PROXY_CONNECTION_REFUSED
from playwright._impl._errors.Error: Page.goto: NS_ERROR_PROXY_CONNECTION_REFUSED
Debug to in ScrapyPlaywrightDownloadHandler._maybe_launch_browser and i got launch_options.
async def _maybe_launch_browser(self) -> None:
async with self.browser_launch_lock:
if not hasattr(self, "browser"):
logger.info("Launching browser %s", self.browser_type.name)
self.browser = await self.browser_type.launch(**self.config.launch_options)
logger.info("Browser %s launched", self.browser_type.name)
self.stats.inc_value("playwright/browser_count")
self.browser.on("disconnected", self._browser_disconnected_callback)
And i copy it to playwright to test and it's works.
example_spider.py
import scrapy
from rich import print
class ExampleSpider(scrapy.Spider):
name = "ex"
start_urls = ["https://httpbin.org/get"]
custom_settings = {
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_BROWSER_TYPE": "firefox",
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"headless": False,
"timeout": 20 * 1000,
'proxy': {
'server': '127.0.0.1:8888',
'username': 'username',
'password': 'password'
}
},
}
def start_requests(self):
yield scrapy.Request(
url=self.start_urls[0],
callback=self.parse_detail,
meta=dict(
playwright=True,
playwright_include_page=True,
playwright_context_kwargs=dict(
java_script_enabled=True,
ignore_https_errors=True,
),
)
)
async def parse_detail(self, response):
print(f"Received response from {response.url}")
yield {}
test_with_playwright.py
import asyncio
from playwright.async_api import async_playwright
async def run_playwright_with_proxy():
kwargs = {
'headless': False,
'timeout': 20000,
'proxy': {
'server': '127.0.0.1:8888',
'username': 'username',
'password': 'password'
}
}
async with async_playwright() as p:
browser = await p.firefox.launch(**kwargs)
page = await browser.new_page()
await page.goto("https://httpbin.org/get")
await asyncio.sleep(100)
print("Page Title:", await page.title())
await browser.close()
if __name__ == "__main__":
asyncio.run(run_playwright_with_proxy())
Activity
elacuesta commentedon Sep 23, 2024
I can not reproduce with mitmproxy:
Slightly adapted sample spider:
Which proxy are you using? Perhaps this is an interaction with that specific provider.
bboyadao commentedon Sep 24, 2024
I have some thoughts
In my case scrapy got 407 then set it failure.
I use https://scrapoxy.io to manage proxies.
elacuesta commentedon Sep 24, 2024
All requests were routed through Playwright, notice the "scrapy-playwright" logger name:
The provided spider works correctly with Scrapoxy. I've started it as indicated in their docs and I'm getting the following logs. There is a failure downloading the response, but that's reasonable because I did not add an actual proxy provider in the Scrapoxy configuration site.
However, if I pass incorrect credentials I do get the reported message:
honzajavorek commentedon Oct 8, 2024
I also experienced
NS_ERROR_PROXY_CONNECTION_REFUSED
with Firefox. I'm pretty sure my proxy settings were right, but given the task at hand, my hunch is that this happens when the target blocks the proxy. I switched to Chromium just to test if the same scraper works better, and I get no errors. It's quite slow though, so superficially it seems that when the proxy gets blocked,scrapy-playwright
knows how to recover and retry in case of Chromium, but fails withNS_ERROR_PROXY_CONNECTION_REFUSED
in case of Firefox.Update: With Chromium I get
playwright._impl._errors.Error: Page.goto: net::ERR_INVALID_ARGUMENT
instead 🤷♂️ Doesn't help me then to switch browsers, but perhaps this helps with figuring out what's the actual underlying problem.sailod commentedon Jan 4, 2025
did you try in headless mode? reproduced with same config you specified besides the headless mode (headless: True)
plus Ive been running it inside a container
maybe related to this case:
microsoft/playwright#33663
even though I didn't set any specific UA or other config that should mess up with the headers.
elacuesta commentedon Jan 10, 2025
Indeed, as of today the spider from my previous comment is failing for me with mitmproxy:
I didn't record which versions I was using back then, now I have:
However setting
PLAYWRIGHT_PROCESS_REQUEST_HEADERS=None
works for bothheadless=False
andheadless=True
.