Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Browsertrix: support multi-wacz crawls #59

Open
makew0rld opened this issue Jul 22, 2024 · 1 comment
Open

Browsertrix: support multi-wacz crawls #59

makew0rld opened this issue Jul 22, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@makew0rld
Copy link
Contributor

makew0rld commented Jul 22, 2024

Multiple WACZs are created for crawls every 10 GB, and also if there are multiple crawler instances. This scenario needs to be tested to see what the webhook request looks like and how to handle it. Currently the code will definitely not handle it correctly.

What it should like is multiple entries in the resources array, each with their own download link.

@makew0rld makew0rld added the enhancement New feature or request label Jul 22, 2024
@makew0rld makew0rld self-assigned this Jul 22, 2024
@makew0rld
Copy link
Contributor Author

makew0rld commented Aug 15, 2024

After chatting on Discord, it looks by using the download API instead of the URL, we can get a single-file "multi-wacz". This should solve the underlying semantic problem of how to deal with multiple WACZ files for one crawl. Code changes are still needed to actually use this API.

image

The problem is that for single WACZ, this means we are downloading an unnecessary wrapper file, the multi-wacz. A ZIP containing a WACZ. Replayweb.page supports this format, but is this what we want to standardize on internally? Idk...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant