Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Readd Pages Background Job #2396

Closed
ikreymer opened this issue Feb 15, 2025 · 3 comments
Closed

Update Readd Pages Background Job #2396

ikreymer opened this issue Feb 15, 2025 · 3 comments
Assignees

Comments

@ikreymer
Copy link
Member

ikreymer commented Feb 15, 2025

The re-add pages background job should be updated in the following ways, to ensure all pages are in the right format for use with #2347.

Based on issues running migration with background jobs, here's a list of things that should be improved:

  • Readd pages job should have more memory to avoid OOM errors - perhaps as much as backend pod
  • Should handle existing data from QA review, perhaps either upserting, or storing the data separately and readding after import?
  • Handle any WACZ loading errors and retry
  • Handle duplicate pages somehow: if the same WACZ is uploaded multiple times, generate a unique id for the page and store page id in a pageId field?
  • Other issues observed: distinct() in mongo fails due to large size, not sure if there's a good solution yet.
@ikreymer
Copy link
Member Author

ikreymer commented Feb 15, 2025

Jotting down some more impl thoughts:

  • create a temp collection for any pages for crawl / org qa_data_<crawl_id> for any pages that have approved,notes,userid,modified, then readd after pages reimported. could be done with two aggregations, i think.
  • Uploads avoid duplicate pages by always using new page id, via Upload Fixes: #2397

ikreymer added a commit that referenced this issue Feb 16, 2025
- ensure upload pages are always added with a new uuid, to avoid any duplicates with existing uploads, even if upload wacz is actually a crawl
from different browsertrix instance, etc..
- cleanup upload names with slugify, which also replaces spaces, fixes uploading wacz filenames with spaces in them
- part of fix for #2396
@tw4l
Copy link
Member

tw4l commented Feb 17, 2025

Related PR #2400 moves re-adding pages from background jobs into the migration to avoid later migrations running before pages have been re-added.

ikreymer added a commit that referenced this issue Feb 17, 2025
- ensure upload pages are always added with a new uuid, to avoid any
duplicates with existing uploads, even if upload wacz is actually a
crawl from different browsertrix instance, etc..
- cleanup upload names with slugify, which also replaces spaces, fixes
uploading wacz filenames with spaces in them
- part of fix for #2396

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
ikreymer added a commit that referenced this issue Feb 18, 2025
Related to #2396 

Changes to migration 0037:
- Re-adds pages in migration rather than in background job to avoid race
condition with later migrations
- Re-adds pages for all uploads in all orgs

Fix for readd pages for org:
- Ensure org filter is applied!
- Fix wrong type
- Remove distinct, use iterator to iterate over crawls faster.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
@ikreymer
Copy link
Member Author

Addressed via #2400, #2406

@github-project-automation github-project-automation bot moved this from Todo to Done! in Webrecorder Projects Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done!
Development

No branches or pull requests

2 participants