Update Readd Pages Background Job #2396

ikreymer · 2025-02-15T19:35:12Z

The re-add pages background job should be updated in the following ways, to ensure all pages are in the right format for use with #2347.

Based on issues running migration with background jobs, here's a list of things that should be improved:

Readd pages job should have more memory to avoid OOM errors - perhaps as much as backend pod
Should handle existing data from QA review, perhaps either upserting, or storing the data separately and readding after import?
Handle any WACZ loading errors and retry
~~Handle duplicate pages somehow: if the same WACZ is uploaded multiple times, generate a unique id for the page and store page id in a pageId field?~~
Other issues observed: distinct() in mongo fails due to large size, not sure if there's a good solution yet.

ikreymer · 2025-02-15T20:18:11Z

Jotting down some more impl thoughts:

create a temp collection for any pages for crawl / org qa_data_<crawl_id> for any pages that have approved,notes,userid,modified, then readd after pages reimported. could be done with two aggregations, i think.
Uploads avoid duplicate pages by always using new page id, via Upload Fixes: #2397

- ensure upload pages are always added with a new uuid, to avoid any duplicates with existing uploads, even if upload wacz is actually a crawl from different browsertrix instance, etc.. - cleanup upload names with slugify, which also replaces spaces, fixes uploading wacz filenames with spaces in them - part of fix for #2396

tw4l · 2025-02-17T19:44:26Z

Related PR #2400 moves re-adding pages from background jobs into the migration to avoid later migrations running before pages have been re-added.

- ensure upload pages are always added with a new uuid, to avoid any duplicates with existing uploads, even if upload wacz is actually a crawl from different browsertrix instance, etc.. - cleanup upload names with slugify, which also replaces spaces, fixes uploading wacz filenames with spaces in them - part of fix for #2396 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

Related to #2396 Changes to migration 0037: - Re-adds pages in migration rather than in background job to avoid race condition with later migrations - Re-adds pages for all uploads in all orgs Fix for readd pages for org: - Ensure org filter is applied! - Fix wrong type - Remove distinct, use iterator to iterate over crawls faster. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>

ikreymer · 2025-02-25T07:25:17Z

Addressed via #2400, #2406

ikreymer assigned tw4l Feb 15, 2025

github-project-automation bot added this to Webrecorder Projects Feb 15, 2025

github-project-automation bot moved this to Triage in Webrecorder Projects Feb 15, 2025

ikreymer moved this from Triage to Todo in Webrecorder Projects Feb 15, 2025

ikreymer mentioned this issue Feb 16, 2025

Upload Fixes: #2397

Merged

tw4l mentioned this issue Feb 17, 2025

Modify page upload migration #2400

Merged

ikreymer closed this as completed Feb 25, 2025

github-project-automation bot moved this from Todo to Done! in Webrecorder Projects Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Readd Pages Background Job #2396

Update Readd Pages Background Job #2396

ikreymer commented Feb 15, 2025 •

edited

Loading

ikreymer commented Feb 15, 2025 •

edited

Loading

tw4l commented Feb 17, 2025

ikreymer commented Feb 25, 2025

Update Readd Pages Background Job #2396

Update Readd Pages Background Job #2396

Comments

ikreymer commented Feb 15, 2025 • edited Loading

ikreymer commented Feb 15, 2025 • edited Loading

tw4l commented Feb 17, 2025

ikreymer commented Feb 25, 2025

ikreymer commented Feb 15, 2025 •

edited

Loading

ikreymer commented Feb 15, 2025 •

edited

Loading