-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot download binary file (PDF) with Chromium headless=new mode #243
Comments
Please provide a minimal, reproducible example. |
import scrapy
class AwesomeSpider(scrapy.Spider):
name = "test_dl"
handle_httpstatus_list = [403]
def start_requests(self):
# GET request
yield scrapy.Request(
"https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
meta={
"playwright": True,
"playwright_page_goto_kwargs": {
"wait_until": "networkidle",
},
},
callback=self.pasrse,
)
async def pasrse(self, response):
print(response.body) output: <!DOCTYPE html><html><head></head><body style="height: 100%; width: 100%; overflow: hidden; margin:0px; background-color: rgb(38, 38, 38);"><embed name="4C80DFDA2738145655DE7937BDA51A0F" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="4C80DFDA2738145655DE7937BDA51A0F"></body></html> instead of bytes @elacuesta here the minimal, reproducible example. |
Sorry, I cannot reproduce with scrapy-playwright 0.0.33 (3122f9c). import scrapy
class AwesomeSpider(scrapy.Spider):
name = "test_dl"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
# "PLAYWRIGHT_BROWSER_TYPE": "firefox", # same result with chromium and firefox
}
def start_requests(self):
yield scrapy.Request(
"https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
meta={
"playwright": True,
"playwright_page_goto_kwargs": {
"wait_until": "networkidle",
},
},
)
async def parse(self, response):
print("Response body size:", len(response.body))
print("First bytes:")
print(response.body[:200])
|
Okay thanks for your fast answer, pretty strange tho, tried with many different versions, always getting the issue. Could you tell us your playwright version please? |
|
@elacuesta We were able to narrow down the problem to two settings. First, using the new headless mode of Chrome, like this: PLAYWRIGHT_LAUNCH_OPTIONS = {
'args': [
'--headless=new',
],
'ignore_default_args': [
'--headless',
],
} Removing this doesn't think the problem alone. We had to rollback to the default value of the Scrapy setting Setting it to 2.7, which seems recommended for new projects, make the problem appear again, whether the new headless chrome mode is enabled or not. |
The PLAYWRIGHT_LAUNCH_OPTIONS = {
'args': ['--headless=new'],
'ignore_default_args': ['--headless'],
} This looks like an upstream bug, the download event is not being fired with the new headless mode. I've opened an upstream Playwright issue (microsoft/playwright-python#2169), although I suspect this is actually a Chromium issue. |
I just saw the update on your Playwright issue: do you think there is a chance you could integrate in your plug-in one of the workarounds posted to handle this? There are also other workarounds in the issues listed |
I will have to take a look to see if the workaround applies in this case, as it was suggested way before the introduction of the new Chromium headless mode. |
Thanks for your help. For now, we try to detect the PDF viewer code when using Chromium and we redirect the download to a non-Playwright spider. We basically compare the content-type returned by the response headers with the real content-type by analyzing the response.body. If the headers say |
I'm a bit hesitant to include the mentioned workaround in the main package for now, but I realized it's possible to implement it with the existing API though the import re
import scrapy
async def init_page(page, request):
async def handle_pdf(route):
response = await page.context.request.get(route.request)
await route.fulfill(
response=response,
headers={**response.headers, "Content-Disposition": "attachment"},
)
await page.route(re.compile(r".*\.pdf"), lambda route: handle_pdf(route))
class PdfSpider(scrapy.Spider):
name = "pdf"
custom_settings = {
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"args": ["--headless=new"],
"ignore_default_args": ["--headless"],
},
}
def start_requests(self):
yield scrapy.Request(
"https://defret.in/assets/certificates/attestation_secnumacademie.pdf",
meta={
"playwright": True,
"playwright_page_init_callback": init_page,
},
)
async def parse(self, response):
print("Response body size:", len(response.body))
print("First bytes:")
print(response.body[:200]) |
Thanks for the code snippet! Unfortunately, it will not work for URLs that don't end with ".pdf" such as "?download=true" etc. We will try to figure something out and keep you updated. |
Yes, that's exactly why I don't want to add the workaround to the main package 😔 |
I am facing an issue when using chromium, when trying to download a PDF file: the response.body is the viewer plugin HTML, not the bytes.
There's already a concerned fix here: 0140b90
It worked for a month, but not anymore, still getting the issue :/
My code hasn't changed since your fix that worked.
The related issue: #184
The text was updated successfully, but these errors were encountered: