Open WACZ files using the Scrapy stores #11

leewesleyv · 2024-10-22T08:09:10Z

Currently we create the clients for fetching files from cloud providers ourselves (in utils.py/wacz.py). Ideally, we want to re-use the functionality that Scrapy has for this to reduce the complexity it brings (testing/maintaining).

The text was updated successfully, but these errors were encountered:

leewesleyv · 2024-10-25T10:51:46Z

It seems like the file stores do not implement methods for downloading/retrieving a file from it. The closest method it has to obtaining information (used in the pipelines) is stat_file that retrieves metadata for a file (checksum, last modified). For example looking into S3FilesStore.stat_file where it calls s3_client.head_object. From the docs:

The HEAD operation retrieves metadata from an object without returning the object itself. This operation is useful if you’re interested only in an object’s metadata.

Similar logic is implemented for the other stores.

This means that we will be responsible for implementing this functionality. The only question that remains is wether we should extend the current stores, or approach it like we do now with some helper functions.

wvengen · 2024-10-28T07:35:50Z

Sounds like it does make sense to take the shortest route (in terms of effort needed) to get this working in a reasonable time. Eventually, this would best be added to Scrapy, so other extensions/middlewares could make use of this too, so this is work that is useful anyway.

Perhaps an order of things could be:

Get clarity on what we'd need from an upstream change.
Get clarity on whether stat_file and download_file (and the things we ended up on in the previous step) would be welcome in Scrapy, e.g. by opening an issue there to consider its inclusion.
If yes, and this may be doable without a long long review process and many technical intricacies, try to get it upstream. If no, document our findings in this in the upstream issue, and solve the issue in this project. Ideally, this would eventually move to Scrapy (unless upstream decides it is not part of its scope).

leewesleyv added the cleanup/optimisation Refactoring or other code improvements label Oct 22, 2024

leewesleyv mentioned this issue Oct 22, 2024

Standardize use of settings for cloud providers across implementations #1

Closed

leewesleyv mentioned this issue Oct 28, 2024

Improve usage docs (#10) #17

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open WACZ files using the Scrapy stores #11

Open WACZ files using the Scrapy stores #11

leewesleyv commented Oct 22, 2024

leewesleyv commented Oct 25, 2024

wvengen commented Oct 28, 2024 •

edited

Loading

Open WACZ files using the Scrapy stores #11

Open WACZ files using the Scrapy stores #11

Comments

leewesleyv commented Oct 22, 2024

leewesleyv commented Oct 25, 2024

wvengen commented Oct 28, 2024 • edited Loading

wvengen commented Oct 28, 2024 •

edited

Loading