Proxy removes cookies #117

jayavinothmoorthy · 2022-08-09T19:28:01Z

Without proxy, cookie applied correctly. But when I use proxy (brightdata), then the cookie is not applied. Did I miss anything?

class ScrapyTest(scrapy.Spider):
    name = 'scrapy test'

    def start_requests(self):

        cookies = {
            'cookieconsent_dismissed': 'yes'
        }

        url = 'https://example.com'

        yield scrapy.Request(url, cookies=cookies, meta={"playwright": True}, callback=self.parse)

settings.py

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "proxy": {
        'server': 'http://zproxy.lum-superproxy.io:22225',
        'username': 'lum-customer-user',
        'password': 'password'
    }
}

COOKIES_ENABLED = True

The text was updated successfully, but these errors were encountered:

elacuesta · 2022-08-10T11:53:41Z

Thanks for the report, I can reproduce with the following spider and a mitmproxy instance running locally:

from scrapy import Request, Spider

class PlaywrightSpiderWithProxy(Spider):
    name = "proxy-spider"
    custom_settings = {
        "LOG_LEVEL": "INFO",
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            # "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                # on a separate terminal:
                # ./mitmproxy --proxyauth "user:pass"
                "server": "http://127.0.0.1:8080",
                "username": "user",
                "password": "pass",
            },
        },
    }

    def start_requests(self):
        yield Request(
            url="http://httpbin.org/headers",
            meta={"playwright": True},
            cookies={"foo": "bar"},
        )

    def parse(self, response):
        print(response.request.headers["Cookie"])
        print(response.text)

The cookie is in the request headers, however no "Cookie" header was received by the server:

b'foo=bar'
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en", 
    "Cache-Control": "no-cache", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "Pragma": "no-cache", 
    "Proxy-Connection": "keep-alive", 
    "User-Agent": "Scrapy/2.6.0 (+https://scrapy.org)", 
    "X-Amzn-Trace-Id": "Root=1-62f39a25-2d7f27dc07329815654c8f19"
  }
}
</pre></body></html>

Without configuring the proxy:

b'foo=bar'
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en", 
    "Cache-Control": "no-cache", 
    "Content-Length": "0", 
    "Cookie": "foo=bar", 
    "Host": "httpbin.org", 
    "Pragma": "no-cache", 
    "User-Agent": "Scrapy/2.6.0 (+https://scrapy.org)", 
    "X-Amzn-Trace-Id": "Root=1-62f39baa-6929799a29e23a171baf57f3"
  }
}
</pre></body></html>

The interesting thing is that at this point, overrides["headers"]["cookie"] is foo=bar in both cases. Also printing the request headers in the callback shows the expected value as in the output I posted above.
I'm going to need to do some further investigation in order to determine if there's anything else the handler is doing that might cause this, or if this is perhaps an upstream issue.

elacuesta · 2022-08-11T03:05:20Z

This seems to be an upstream thing, I just opened microsoft/playwright#16439 asking about it.
There might be a way to work around this by setting cookies in the context before sending the request. However I'm not sure because cookies are set for the whole context, applying to multiple requests, and that's not necessarily what we want here (clearing/repopulating the context cookies after each request seems like an overkill). Let's just wait and see what the Playwright team says.

blacksteel1288 · 2024-02-11T21:49:27Z

@elacuesta I'm experiencing the same issue. What's the proper way to set cookies in the whole context?

elacuesta · 2024-02-13T18:34:03Z

To set cookies for a whole context at the Playwright level I'd say there are at least 3 ways:

requesting to receive the page object in a callback with playwright_include_page, access the context and use BrowserContext.add_cookies on it
specifying storage_state in the PLAYWRIGHT_CONTEXTS setting
specifying storage_state in the playwright_context_kwargs request meta key

Examples for 2 & 3 can be found in the contexts.py file within the examples directory. There's also an example on accessing the context in a callback for (1) in these lines.

To be clear, I don't know if these methods work to avoid skipping the cookies when using proxies, please report back your findings if you can.

AdilKhan000 · 2024-11-03T14:31:47Z

Is the issue similar to #4717 ?

elacuesta added bug Something isn't working upstream issue labels Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy removes cookies #117

Proxy removes cookies #117

jayavinothmoorthy commented Aug 9, 2022 •

edited

Loading

elacuesta commented Aug 10, 2022 •

edited

Loading

elacuesta commented Aug 11, 2022

blacksteel1288 commented Feb 11, 2024

elacuesta commented Feb 13, 2024

AdilKhan000 commented Nov 3, 2024 •

edited

Loading

Proxy removes cookies #117

Proxy removes cookies #117

Comments

jayavinothmoorthy commented Aug 9, 2022 • edited Loading

elacuesta commented Aug 10, 2022 • edited Loading

elacuesta commented Aug 11, 2022

blacksteel1288 commented Feb 11, 2024

elacuesta commented Feb 13, 2024

AdilKhan000 commented Nov 3, 2024 • edited Loading

jayavinothmoorthy commented Aug 9, 2022 •

edited

Loading

elacuesta commented Aug 10, 2022 •

edited

Loading

AdilKhan000 commented Nov 3, 2024 •

edited

Loading