[Feature]: Convert migration 0042 to background job / versioned crawl objects #2406

ikreymer · 2025-02-18T18:09:46Z

What change would you like to see?

Migration 0042 should instead be converted to start a background job. The background job would then migrate each crawl and a version on the crawl object. Migrating the crawl means reimporting the pages with the proper data.

The background job would do something like following:

for each crawl where version != and not_migrating:
   if pages.find_all(filename == null) == 0:
      crawl.set(version, 2) # already migrated
   else:
      # reimport pages for crawl
      pages.re_add_crawl_pages(crawl.id)
      crawl.set(version, 2)

In /replay.json API endpoints, the pagesQueryUrl and initialResources are included only if version == 2 for all crawls.
New crawl objects would have version set to 2
Migration 0042 would start this background job
A new endpoint /jobs/migrateCrawls might be added to also start this job.
Job retry endpoint can be used to retry the job if it fails?
/crawls/migrationNeeded could be added to check if any crawls need migration.
Self-deployment docs should be updated to mention this migration, how to check if it succeeds

Context

Since migration 0042 (adding filenames and other data to pages) may potentially take a long time, and is an optimization for collection, we can instead convert the slow running migration 0042 into a background job.

This would allow the 1.14 migration to launch quickly, while crawls are being migrated in the background. Unmigrated crawls can still be added to collections, just will loaded slightly slower until they are migrated.

This is a trade-off between faster migration to 1.14 and migration of existing crawls to take advantage of optimized replay.

The text was updated successfully, but these errors were encountered:

ikreymer added the enhancement New feature or request label Feb 18, 2025

ikreymer assigned tw4l Feb 18, 2025

github-project-automation bot added this to Webrecorder Projects Feb 18, 2025

github-project-automation bot moved this to Triage in Webrecorder Projects Feb 18, 2025

ikreymer moved this from Triage to Todo in Webrecorder Projects Feb 18, 2025

tw4l moved this from Todo to Implementing in Webrecorder Projects Feb 18, 2025

tw4l mentioned this issue Feb 18, 2025

Rework crawl page migration #2412

Merged

2 tasks

tw4l moved this from Implementing to In Review in Webrecorder Projects Feb 18, 2025

ikreymer closed this as completed in #2412 Feb 20, 2025

ikreymer closed this as completed in f8fb2d2 Feb 20, 2025

github-project-automation bot moved this from In Review to Done! in Webrecorder Projects Feb 20, 2025

ikreymer mentioned this issue Feb 25, 2025

Update Readd Pages Background Job #2396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Convert migration 0042 to background job / versioned crawl objects #2406

[Feature]: Convert migration 0042 to background job / versioned crawl objects #2406

ikreymer commented Feb 18, 2025 •

edited

Loading

[Feature]: Convert migration 0042 to background job / versioned crawl objects #2406

[Feature]: Convert migration 0042 to background job / versioned crawl objects #2406

Comments

ikreymer commented Feb 18, 2025 • edited Loading

What change would you like to see?

Context

ikreymer commented Feb 18, 2025 •

edited

Loading