Skip to content

images don't appear to get read from the persistent context properly / cached #198

Open
@pjlsergeant

Description

@pjlsergeant

I'm having trouble getting Scrapy + Playwright to respect caches when crawling, when using a persistent context. I've tried to get it down to a minimal example, which you can see here:

https://github.com/pjlsergeant/scrapy-playwright-cache-bug

app.py is a minimal Flask app to demonstrate; if you start it (flask run) and then run the scrape (scrapy crawl crawl), you can see that the PNG at /pixel doesn't get cached, both from the flask logs and by the final body output: <html><head></head><body>count:6</body></html>, signifying 6 hits.

Interestingly, if you then manually load up Playwright using the persistent config (something like browser_context = chromium.launch_persistent_context(userDataDir)), you'll see the image is already cached, so the image is being written to the cache during Playwright+Scrapy's run, it's just not being loaded from the cache when Playwright is being driven by Scrapy.

Any help gratefully received

Activity

elacuesta

elacuesta commented on Sep 4, 2023

@elacuesta
Member

It looks like this is caused by the use of Page.route. In their docs it says:

Enabling routing disables http cache.

Unfortunately, this is necessary for some of the functionality of this integration, as I've explained elsewhere.

Seems like this is a known limitation and a lot of people are eager to have it removed from upstream Playwright: microsoft/playwright#7220.

alembiewski

alembiewski commented on Dec 9, 2024

@alembiewski

It's been over a year since this issue was opened - is it still impossible to enable caching of static resources like JS or CSS to speed up scraping? Are there any workarounds to allow this with scrapy-playwright?

elacuesta

elacuesta commented on Dec 28, 2024

@elacuesta
Member

No progress here, my previous comment still applies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @pjlsergeant@elacuesta@alembiewski

        Issue actions

          images don't appear to get read from the persistent context properly / cached · Issue #198 · scrapy-plugins/scrapy-playwright