Description
I'm having trouble getting Scrapy + Playwright to respect caches when crawling, when using a persistent context. I've tried to get it down to a minimal example, which you can see here:
https://github.com/pjlsergeant/scrapy-playwright-cache-bug
app.py is a minimal Flask app to demonstrate; if you start it (flask run
) and then run the scrape (scrapy crawl crawl
), you can see that the PNG at /pixel
doesn't get cached, both from the flask logs and by the final body output: <html><head></head><body>count:6</body></html>
, signifying 6 hits.
Interestingly, if you then manually load up Playwright using the persistent config (something like browser_context = chromium.launch_persistent_context(userDataDir)
), you'll see the image is already cached, so the image is being written to the cache during Playwright+Scrapy's run, it's just not being loaded from the cache when Playwright is being driven by Scrapy.
Any help gratefully received
Activity
elacuesta commentedon Sep 4, 2023
It looks like this is caused by the use of
Page.route
. In their docs it says:Unfortunately, this is necessary for some of the functionality of this integration, as I've explained elsewhere.
Seems like this is a known limitation and a lot of people are eager to have it removed from upstream Playwright: microsoft/playwright#7220.
alembiewski commentedon Dec 9, 2024
It's been over a year since this issue was opened - is it still impossible to enable caching of static resources like JS or CSS to speed up scraping? Are there any workarounds to allow this with scrapy-playwright?
elacuesta commentedon Dec 28, 2024
No progress here, my previous comment still applies.