Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open WACZ files using the Scrapy stores #11

Open
leewesleyv opened this issue Oct 22, 2024 · 2 comments
Open

Open WACZ files using the Scrapy stores #11

leewesleyv opened this issue Oct 22, 2024 · 2 comments
Labels
cleanup/optimisation Refactoring or other code improvements

Comments

@leewesleyv
Copy link
Collaborator

Currently we create the clients for fetching files from cloud providers ourselves (in utils.py/wacz.py). Ideally, we want to re-use the functionality that Scrapy has for this to reduce the complexity it brings (testing/maintaining).

@leewesleyv
Copy link
Collaborator Author

It seems like the file stores do not implement methods for downloading/retrieving a file from it. The closest method it has to obtaining information (used in the pipelines) is stat_file that retrieves metadata for a file (checksum, last modified). For example looking into S3FilesStore.stat_file where it calls s3_client.head_object. From the docs:

The HEAD operation retrieves metadata from an object without returning the object itself. This operation is useful if you’re interested only in an object’s metadata.

Similar logic is implemented for the other stores.

This means that we will be responsible for implementing this functionality. The only question that remains is wether we should extend the current stores, or approach it like we do now with some helper functions.

@wvengen
Copy link
Member

wvengen commented Oct 28, 2024

Sounds like it does make sense to take the shortest route (in terms of effort needed) to get this working in a reasonable time. Eventually, this would best be added to Scrapy, so other extensions/middlewares could make use of this too, so this is work that is useful anyway.

Perhaps an order of things could be:

  1. Get clarity on what we'd need from an upstream change.
  2. Get clarity on whether stat_file and download_file (and the things we ended up on in the previous step) would be welcome in Scrapy, e.g. by opening an issue there to consider its inclusion.
  3. If yes, and this may be doable without a long long review process and many technical intricacies, try to get it upstream. If no, document our findings in this in the upstream issue, and solve the issue in this project. Ideally, this would eventually move to Scrapy (unless upstream decides it is not part of its scope).

@leewesleyv leewesleyv mentioned this issue Oct 28, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup/optimisation Refactoring or other code improvements
Projects
None yet
Development

No branches or pull requests

2 participants