Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy removes cookies #117

Open
jayavinothmoorthy opened this issue Aug 9, 2022 · 5 comments
Open

Proxy removes cookies #117

jayavinothmoorthy opened this issue Aug 9, 2022 · 5 comments
Labels
bug Something isn't working upstream issue

Comments

@jayavinothmoorthy
Copy link

jayavinothmoorthy commented Aug 9, 2022

Without proxy, cookie applied correctly. But when I use proxy (brightdata), then the cookie is not applied. Did I miss anything?

class ScrapyTest(scrapy.Spider):
    name = 'scrapy test'

    def start_requests(self):

        cookies = {
            'cookieconsent_dismissed': 'yes'
        }

        url = 'https://example.com'

        yield scrapy.Request(url, cookies=cookies, meta={"playwright": True}, callback=self.parse)

settings.py

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "proxy": {
        'server': 'http://zproxy.lum-superproxy.io:22225',
        'username': 'lum-customer-user',
        'password': 'password'
    }
}

COOKIES_ENABLED = True
@elacuesta
Copy link
Member

elacuesta commented Aug 10, 2022

Thanks for the report, I can reproduce with the following spider and a mitmproxy instance running locally:

from scrapy import Request, Spider

class PlaywrightSpiderWithProxy(Spider):
    name = "proxy-spider"
    custom_settings = {
        "LOG_LEVEL": "INFO",
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
            # "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                # on a separate terminal:
                # ./mitmproxy --proxyauth "user:pass"
                "server": "http://127.0.0.1:8080",
                "username": "user",
                "password": "pass",
            },
        },
    }

    def start_requests(self):
        yield Request(
            url="http://httpbin.org/headers",
            meta={"playwright": True},
            cookies={"foo": "bar"},
        )

    def parse(self, response):
        print(response.request.headers["Cookie"])
        print(response.text)

The cookie is in the request headers, however no "Cookie" header was received by the server:

b'foo=bar'
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en", 
    "Cache-Control": "no-cache", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "Pragma": "no-cache", 
    "Proxy-Connection": "keep-alive", 
    "User-Agent": "Scrapy/2.6.0 (+https://scrapy.org)", 
    "X-Amzn-Trace-Id": "Root=1-62f39a25-2d7f27dc07329815654c8f19"
  }
}
</pre></body></html>

Without configuring the proxy:

b'foo=bar'
<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "en", 
    "Cache-Control": "no-cache", 
    "Content-Length": "0", 
    "Cookie": "foo=bar", 
    "Host": "httpbin.org", 
    "Pragma": "no-cache", 
    "User-Agent": "Scrapy/2.6.0 (+https://scrapy.org)", 
    "X-Amzn-Trace-Id": "Root=1-62f39baa-6929799a29e23a171baf57f3"
  }
}
</pre></body></html>

The interesting thing is that at this point, overrides["headers"]["cookie"] is foo=bar in both cases. Also printing the request headers in the callback shows the expected value as in the output I posted above.
I'm going to need to do some further investigation in order to determine if there's anything else the handler is doing that might cause this, or if this is perhaps an upstream issue.

@elacuesta elacuesta added bug Something isn't working upstream issue labels Aug 10, 2022
@elacuesta
Copy link
Member

This seems to be an upstream thing, I just opened microsoft/playwright#16439 asking about it.
There might be a way to work around this by setting cookies in the context before sending the request. However I'm not sure because cookies are set for the whole context, applying to multiple requests, and that's not necessarily what we want here (clearing/repopulating the context cookies after each request seems like an overkill). Let's just wait and see what the Playwright team says.

@blacksteel1288
Copy link

@elacuesta I'm experiencing the same issue. What's the proper way to set cookies in the whole context?

@elacuesta
Copy link
Member

To set cookies for a whole context at the Playwright level I'd say there are at least 3 ways:

  1. requesting to receive the page object in a callback with playwright_include_page, access the context and use BrowserContext.add_cookies on it
  2. specifying storage_state in the PLAYWRIGHT_CONTEXTS setting
  3. specifying storage_state in the playwright_context_kwargs request meta key

Examples for 2 & 3 can be found in the contexts.py file within the examples directory. There's also an example on accessing the context in a callback for (1) in these lines.

To be clear, I don't know if these methods work to avoid skipping the cookies when using proxies, please report back your findings if you can.

@AdilKhan000
Copy link

AdilKhan000 commented Nov 3, 2024

Is the issue similar to #4717 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working upstream issue
Projects
None yet
Development

No branches or pull requests

4 participants