Browsertrix: support multi-wacz crawls #59

makew0rld · 2024-07-22T15:27:31Z

Multiple WACZs are created for crawls every 10 GB, and also if there are multiple crawler instances. This scenario needs to be tested to see what the webhook request looks like and how to handle it. Currently the code will definitely not handle it correctly.

What it should like is multiple entries in the resources array, each with their own download link.

The text was updated successfully, but these errors were encountered:

makew0rld · 2024-08-15T19:26:50Z

After chatting on Discord, it looks by using the download API instead of the URL, we can get a single-file "multi-wacz". This should solve the underlying semantic problem of how to deal with multiple WACZ files for one crawl. Code changes are still needed to actually use this API.

The problem is that for single WACZ, this means we are downloading an unnecessary wrapper file, the multi-wacz. A ZIP containing a WACZ. Replayweb.page supports this format, but is this what we want to standardize on internally? Idk...

makew0rld added the enhancement New feature or request label Jul 22, 2024

makew0rld self-assigned this Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Browsertrix: support multi-wacz crawls #59

Browsertrix: support multi-wacz crawls #59

makew0rld commented Jul 22, 2024 •

edited

Loading

makew0rld commented Aug 15, 2024 •

edited

Loading

Browsertrix: support multi-wacz crawls #59

Browsertrix: support multi-wacz crawls #59

Comments

makew0rld commented Jul 22, 2024 • edited Loading

makew0rld commented Aug 15, 2024 • edited Loading

makew0rld commented Jul 22, 2024 •

edited

Loading

makew0rld commented Aug 15, 2024 •

edited

Loading