Cannot download binary file (PDF) with Chromium headless=new mode #243

tommylge · 2023-11-15T09:14:47Z

I am facing an issue when using chromium, when trying to download a PDF file: the response.body is the viewer plugin HTML, not the bytes.

There's already a concerned fix here: 0140b90

It worked for a month, but not anymore, still getting the issue :/

My code hasn't changed since your fix that worked.

The related issue: #184

The text was updated successfully, but these errors were encountered:

elacuesta · 2023-11-15T12:54:59Z

Please provide a minimal, reproducible example.

tommylge · 2023-11-15T16:59:20Z

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "test_dl"
    handle_httpstatus_list = [403]

    def start_requests(self):
        # GET request
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_goto_kwargs": {
                    "wait_until": "networkidle",
                },
            },
            callback=self.pasrse,
        )

    async def pasrse(self, response):
        print(response.body)

output:

<!DOCTYPE html><html><head></head><body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);"><embed name="4C80DFDA2738145655DE7937BDA51A0F" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="4C80DFDA2738145655DE7937BDA51A0F"></body></html>

instead of bytes

@elacuesta here the minimal, reproducible example.

elacuesta · 2023-11-16T16:50:29Z

Sorry, I cannot reproduce with scrapy-playwright 0.0.33 (3122f9c).

import scrapy

class AwesomeSpider(scrapy.Spider):
    name = "test_dl"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        # "PLAYWRIGHT_BROWSER_TYPE": "firefox",  # same result with chromium and firefox
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_goto_kwargs": {
                    "wait_until": "networkidle",
                },
            },
        )

    async def parse(self, response):
        print("Response body size:", len(response.body))
        print("First bytes:")
        print(response.body[:200])

2023-11-16 13:46:09 [scrapy.core.engine] INFO: Spider opened
2023-11-16 13:46:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-11-16 13:46:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-11-16 13:46:09 [scrapy-playwright] INFO: Starting download handler
2023-11-16 13:46:14 [scrapy-playwright] INFO: Launching browser chromium
2023-11-16 13:46:14 [scrapy-playwright] INFO: Browser chromium launched
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: Browser context started: 'default' (persistent=False, remote=False)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] New page created, page count is 1 (1 for all contexts)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] Request: <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> (resource type: document)
2023-11-16 13:46:14 [scrapy-playwright] DEBUG: [Context=default] Response: <200 https://defret.in/assets/certificates/attestation_secnumacademie.pdf>
2023-11-16 13:46:15 [scrapy-playwright] WARNING: Navigating to <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> returned None, the response will have empty headers and status 200
2023-11-16 13:46:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://defret.in/assets/certificates/attestation_secnumacademie.pdf> (referer: None) ['playwright']
Response body size: 1868169
First bytes:
b"%PDF-1.3\n%\xe2\xe3\xcf\xd3\n9 0 obj\n<< /Type /Page /Parent 1 0 R /LastModified (D:20200619180943+02'00') /Resources 2 0 R /MediaBox [0.000000 0.000000 841.890000 595.276000] /CropBox [0.000000 0.000000 841.890000 "
2023-11-16 13:46:15 [scrapy.core.engine] INFO: Closing spider (finished)

$ scrapy version -v
Scrapy       : 2.11.0
lxml         : 4.9.3.0
libxml2      : 2.10.3
cssselect    : 1.2.0
parsel       : 1.8.1
w3lib        : 2.1.2
Twisted      : 22.10.0
Python       : 3.10.0 (default, Oct  8 2021, 09:55:22) [GCC 7.5.0]
pyOpenSSL    : 23.2.0 (OpenSSL 3.1.2 1 Aug 2023)
cryptography : 41.0.3
Platform     : Linux-5.15.0-79-generic-x86_64-with-glibc2.35

tommylge · 2023-11-16T17:35:49Z

Okay thanks for your fast answer, pretty strange tho, tried with many different versions, always getting the issue.
I guess i didn't debug enough yet and so seems like it doesn't comes from scrapy-playwright.

Could you tell us your playwright version please?
I'll keep you up to date.

elacuesta · 2023-11-16T19:04:52Z

Could you tell us your playwright version please?

$ playwright --version               
Version 1.39.0

kinoute · 2023-11-18T17:45:24Z

@elacuesta We were able to narrow down the problem to two settings. First, using the new headless mode of Chrome, like this:

PLAYWRIGHT_LAUNCH_OPTIONS = {
      'args': [
          '--headless=new',
      ],
      'ignore_default_args': [
          '--headless',
      ],
}

Removing this doesn't think the problem alone. We had to rollback to the default value of the Scrapy setting REQUEST_FINGERPRINTER_IMPLEMENTATION which is 2.6 : https://docs.scrapy.org/en/latest/topics/request-response.html#request-fingerprinter-implementation

Setting it to 2.7, which seems recommended for new projects, make the problem appear again, whether the new headless chrome mode is enabled or not.

elacuesta · 2023-11-18T18:57:16Z

The REQUEST_FINGERPRINTER_IMPLEMENTATION setting is not relevant here, I tried several settings combinations and it did not change the results. The relevant part is the new Chromium headless mode, enabled as you mentioned:

PLAYWRIGHT_LAUNCH_OPTIONS = {
    'args': ['--headless=new'],
    'ignore_default_args': ['--headless'],
}

This looks like an upstream bug, the download event is not being fired with the new headless mode. I've opened an upstream Playwright issue (microsoft/playwright-python#2169), although I suspect this is actually a Chromium issue.

kinoute · 2023-11-21T12:20:06Z

I just saw the update on your Playwright issue: do you think there is a chance you could integrate in your plug-in one of the workarounds posted to handle this? There are also other workarounds in the issues listed

elacuesta · 2023-11-21T13:00:38Z

I will have to take a look to see if the workaround applies in this case, as it was suggested way before the introduction of the new Chromium headless mode.

kinoute · 2023-11-23T16:42:21Z

Thanks for your help. For now, we try to detect the PDF viewer code when using Chromium and we redirect the download to a non-Playwright spider.

We basically compare the content-type returned by the response headers with the real content-type by analyzing the response.body. If the headers say application/pdf but the body says text/html, we redirect.

elacuesta · 2023-11-28T23:54:31Z

I'm a bit hesitant to include the mentioned workaround in the main package for now, but I realized it's possible to implement it with the existing API though the playwright_page_init_callback meta key. Hope that helps.

import re
import scrapy


async def init_page(page, request):
    async def handle_pdf(route):
        response = await page.context.request.get(route.request)
        await route.fulfill(
            response=response,
            headers={**response.headers, "Content-Disposition": "attachment"},
        )

    await page.route(re.compile(r".*\.pdf"), lambda route: handle_pdf(route))


class PdfSpider(scrapy.Spider):
    name = "pdf"
    custom_settings = {
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
        "DOWNLOAD_HANDLERS": {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "args": ["--headless=new"],
            "ignore_default_args": ["--headless"],
        },
    }

    def start_requests(self):
        yield scrapy.Request(
            "https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
            meta={
                "playwright": True,
                "playwright_page_init_callback": init_page,
            },
        )

    async def parse(self, response):
        print("Response body size:", len(response.body))
        print("First bytes:")
        print(response.body[:200])

kinoute · 2023-11-29T06:26:12Z

Thanks for the code snippet! Unfortunately, it will not work for URLs that don't end with ".pdf" such as "?download=true" etc. We will try to figure something out and keep you updated.

elacuesta · 2023-11-29T14:40:45Z

Yes, that's exactly why I don't want to add the workaround to the main package 😔

elacuesta added the needs more info label Nov 15, 2023

elacuesta added could not reproduce and removed needs more info labels Nov 16, 2023

elacuesta added upstream issue and removed could not reproduce labels Nov 18, 2023

elacuesta changed the title ~~Binary file (PDF) download~~ Cannot download binary file (PDF) with Chromium headless=new mode Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot download binary file (PDF) with Chromium headless=new mode #243

Cannot download binary file (PDF) with Chromium headless=new mode #243

tommylge commented Nov 15, 2023

elacuesta commented Nov 15, 2023

tommylge commented Nov 15, 2023

elacuesta commented Nov 16, 2023 •

edited

Loading

tommylge commented Nov 16, 2023 •

edited

Loading

elacuesta commented Nov 16, 2023

kinoute commented Nov 18, 2023

elacuesta commented Nov 18, 2023 •

edited

Loading

kinoute commented Nov 21, 2023 •

edited

Loading

elacuesta commented Nov 21, 2023

kinoute commented Nov 23, 2023

elacuesta commented Nov 28, 2023

kinoute commented Nov 29, 2023

elacuesta commented Nov 29, 2023

Cannot download binary file (PDF) with Chromium headless=new mode #243

Cannot download binary file (PDF) with Chromium headless=new mode #243

Comments

tommylge commented Nov 15, 2023

elacuesta commented Nov 15, 2023

tommylge commented Nov 15, 2023

elacuesta commented Nov 16, 2023 • edited Loading

tommylge commented Nov 16, 2023 • edited Loading

elacuesta commented Nov 16, 2023

kinoute commented Nov 18, 2023

elacuesta commented Nov 18, 2023 • edited Loading

kinoute commented Nov 21, 2023 • edited Loading

elacuesta commented Nov 21, 2023

kinoute commented Nov 23, 2023

elacuesta commented Nov 28, 2023

kinoute commented Nov 29, 2023

elacuesta commented Nov 29, 2023

elacuesta commented Nov 16, 2023 •

edited

Loading

tommylge commented Nov 16, 2023 •

edited

Loading

elacuesta commented Nov 18, 2023 •

edited

Loading

kinoute commented Nov 21, 2023 •

edited

Loading