Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

images don't appear to get read from the persistent context properly / cached #198

Open
pjlsergeant opened this issue May 10, 2023 · 3 comments

Comments

@pjlsergeant
Copy link

I'm having trouble getting Scrapy + Playwright to respect caches when crawling, when using a persistent context. I've tried to get it down to a minimal example, which you can see here:

https://github.com/pjlsergeant/scrapy-playwright-cache-bug

app.py is a minimal Flask app to demonstrate; if you start it (flask run) and then run the scrape (scrapy crawl crawl), you can see that the PNG at /pixel doesn't get cached, both from the flask logs and by the final body output: <html><head></head><body>count:6</body></html>, signifying 6 hits.

Interestingly, if you then manually load up Playwright using the persistent config (something like browser_context = chromium.launch_persistent_context(userDataDir)), you'll see the image is already cached, so the image is being written to the cache during Playwright+Scrapy's run, it's just not being loaded from the cache when Playwright is being driven by Scrapy.

Any help gratefully received

@elacuesta
Copy link
Member

It looks like this is caused by the use of Page.route. In their docs it says:

Enabling routing disables http cache.

Unfortunately, this is necessary for some of the functionality of this integration, as I've explained elsewhere.

Seems like this is a known limitation and a lot of people are eager to have it removed from upstream Playwright: microsoft/playwright#7220.

@alembiewski
Copy link

It's been over a year since this issue was opened - is it still impossible to enable caching of static resources like JS or CSS to speed up scraping? Are there any workarounds to allow this with scrapy-playwright?

@elacuesta
Copy link
Member

No progress here, my previous comment still applies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants