Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot download binary file (PDF) with Chromium headless=new mode #243

Open
tommylge opened this issue Nov 15, 2023 · 13 comments
Open

Cannot download binary file (PDF) with Chromium headless=new mode #243

tommylge opened this issue Nov 15, 2023 · 13 comments

Comments

@tommylge
Copy link

I am facing an issue when using chromium, when trying to download a PDF file: the response.body is the viewer plugin HTML, not the bytes.

There's already a concerned fix here: 0140b90

It worked for a month, but not anymore, still getting the issue :/

My code hasn't changed since your fix that worked.

The related issue: #184

@elacuesta
Copy link
Member

Please provide a minimal, reproducible example.

@tommylge
Copy link
Author

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "test_dl"
    handle_httpstatus_list = [403]

    def start_requests(self):
        # GET request
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_goto_kwargs": {
                    "wait_until": "networkidle",
                },
            },
            callback=self.pasrse,
        )

    async def pasrse(self, response):
        print(response.body)

output:

<!DOCTYPE html><html><head></head><body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);"><embed name="4C80DFDA2738145655DE7937BDA51A0F" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="4C80DFDA2738145655DE7937BDA51A0F"></body></html>

instead of bytes

@elacuesta here the minimal, reproducible example.

@elacuesta
Copy link
Member

elacuesta commented Nov 16, 2023

Sorry, I cannot reproduce with scrapy-playwright 0.0.33 (3122f9c).

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "test_dl"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        # "PLAYWRIGHT_BROWSER_TYPE": "firefox",  # same result with chromium and firefox
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_goto_kwargs": {
                    "wait_until": "networkidle",
                },
            },
        )

    async def parse(self, response):
        print("Response body size:", len(response.body))
        print("First bytes:")
        print(response.body[:200])
2023-11-16 13:46:09 [scrapy.core.engine] INFO: Spider opened
2023-11-16 13:46:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-16 13:46:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-16 13:46:09 [scrapy-playwright] INFO: Starting download handler
2023-11-16 13:46:14 [scrapy-playwright] INFO: Launching browser chromium
2023-11-16 13:46:14 [scrapy-playwright] INFO: Browser chromium launched
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> (resource type: document)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://defret.in/assets/certificates/attestation_secnumacademie.pdf>
2023-11-16 13:46:15 [scrapy-playwright] WARNING: Navigating to <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> returned None, the response will have empty headers and status 200
2023-11-16 13:46:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> (referer: None) ['playwright']
Response body size: 1868169
First bytes:
b"%PDF-1.3\n%\xe2\xe3\xcf\xd3\n9 0 obj\n<< /Type /Page /Parent 1 0 R /LastModified (D:20200619180943+02'00') /Resources 2 0 R /MediaBox [0.000000 0.000000 841.890000 595.276000] /CropBox [0.000000 0.000000 841.890000 "
2023-11-16 13:46:15 [scrapy.core.engine] INFO: Closing spider (finished)
$ scrapy version -v
Scrapy       : 2.11.0
lxml         : 4.9.3.0
libxml2      : 2.10.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 22.10.0
Python       : 3.10.0 (default, Oct  8 2021, 09:55:22) [GCC 7.5.0]
pyOpenSSL    : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform     : Linux-5.15.0-79-generic-x86_64-with-glibc2.35

@tommylge
Copy link
Author

tommylge commented Nov 16, 2023

Okay thanks for your fast answer, pretty strange tho, tried with many different versions, always getting the issue.
I guess i didn't debug enough yet and so seems like it doesn't comes from scrapy-playwright.

Could you tell us your playwright version please?
I'll keep you up to date.

@elacuesta
Copy link
Member

Could you tell us your playwright version please?

$ playwright --version               
Version 1.39.0

@kinoute
Copy link

kinoute commented Nov 18, 2023

@elacuesta We were able to narrow down the problem to two settings. First, using the new headless mode of Chrome, like this:

PLAYWRIGHT_LAUNCH_OPTIONS = {
      'args': [
          '--headless=new',
      ],
      'ignore_default_args': [
          '--headless',
      ],
}

Removing this doesn't think the problem alone. We had to rollback to the default value of the Scrapy setting REQUEST_FINGERPRINTER_IMPLEMENTATION which is 2.6 : https://docs.scrapy.org/en/latest/topics/request-response.html#request-fingerprinter-implementation

Setting it to 2.7, which seems recommended for new projects, make the problem appear again, whether the new headless chrome mode is enabled or not.

@elacuesta
Copy link
Member

elacuesta commented Nov 18, 2023

The REQUEST_FINGERPRINTER_IMPLEMENTATION setting is not relevant here, I tried several settings combinations and it did not change the results. The relevant part is the new Chromium headless mode, enabled as you mentioned:

PLAYWRIGHT_LAUNCH_OPTIONS = {
    'args': ['--headless=new'],
    'ignore_default_args': ['--headless'],
}

This looks like an upstream bug, the download event is not being fired with the new headless mode. I've opened an upstream Playwright issue (microsoft/playwright-python#2169), although I suspect this is actually a Chromium issue.

@kinoute
Copy link

kinoute commented Nov 21, 2023

I just saw the update on your Playwright issue: do you think there is a chance you could integrate in your plug-in one of the workarounds posted to handle this? There are also other workarounds in the issues listed

@elacuesta elacuesta changed the title Binary file (PDF) download Cannot download binary file (PDF) with Chromium headless=new mode Nov 21, 2023
@elacuesta
Copy link
Member

I will have to take a look to see if the workaround applies in this case, as it was suggested way before the introduction of the new Chromium headless mode.

@kinoute
Copy link

kinoute commented Nov 23, 2023

Thanks for your help. For now, we try to detect the PDF viewer code when using Chromium and we redirect the download to a non-Playwright spider.

We basically compare the content-type returned by the response headers with the real content-type by analyzing the response.body. If the headers say application/pdf but the body says text/html, we redirect.

@elacuesta
Copy link
Member

I'm a bit hesitant to include the mentioned workaround in the main package for now, but I realized it's possible to implement it with the existing API though the playwright_page_init_callback meta key. Hope that helps.

import re
import scrapy


async def init_page(page, request):
    async def handle_pdf(route):
        response = await page.context.request.get(route.request)
        await route.fulfill(
            response=response,
            headers={**response.headers, "Content-Disposition": "attachment"},
        )

    await page.route(re.compile(r".*\.pdf"), lambda route: handle_pdf(route))


class PdfSpider(scrapy.Spider):
    name = "pdf"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "args": ["--headless=new"],
            "ignore_default_args": ["--headless"],
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_init_callback": init_page,
            },
        )

    async def parse(self, response):
        print("Response body size:", len(response.body))
        print("First bytes:")
        print(response.body[:200])

@kinoute
Copy link

kinoute commented Nov 29, 2023

Thanks for the code snippet! Unfortunately, it will not work for URLs that don't end with ".pdf" such as "?download=true" etc. We will try to figure something out and keep you updated.

@elacuesta
Copy link
Member

Yes, that's exactly why I don't want to add the workaround to the main package 😔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants