You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm having trouble getting Scrapy + Playwright to respect caches when crawling, when using a persistent context. I've tried to get it down to a minimal example, which you can see here:
app.py is a minimal Flask app to demonstrate; if you start it (flask run) and then run the scrape (scrapy crawl crawl), you can see that the PNG at /pixel doesn't get cached, both from the flask logs and by the final body output: <html><head></head><body>count:6</body></html>, signifying 6 hits.
Interestingly, if you then manually load up Playwright using the persistent config (something like browser_context = chromium.launch_persistent_context(userDataDir)), you'll see the image is already cached, so the image is being written to the cache during Playwright+Scrapy's run, it's just not being loaded from the cache when Playwright is being driven by Scrapy.
Any help gratefully received
The text was updated successfully, but these errors were encountered:
It's been over a year since this issue was opened - is it still impossible to enable caching of static resources like JS or CSS to speed up scraping? Are there any workarounds to allow this with scrapy-playwright?
I'm having trouble getting Scrapy + Playwright to respect caches when crawling, when using a persistent context. I've tried to get it down to a minimal example, which you can see here:
https://github.com/pjlsergeant/scrapy-playwright-cache-bug
app.py is a minimal Flask app to demonstrate; if you start it (
flask run
) and then run the scrape (scrapy crawl crawl
), you can see that the PNG at/pixel
doesn't get cached, both from the flask logs and by the final body output:<html><head></head><body>count:6</body></html>
, signifying 6 hits.Interestingly, if you then manually load up Playwright using the persistent config (something like
browser_context = chromium.launch_persistent_context(userDataDir)
), you'll see the image is already cached, so the image is being written to the cache during Playwright+Scrapy's run, it's just not being loaded from the cache when Playwright is being driven by Scrapy.Any help gratefully received
The text was updated successfully, but these errors were encountered: