QA Backend: Add pages API endpoints, including support for manual review #1502

tw4l · 2024-01-30T18:57:49Z

Add crawls/<crawl_id>/pages API endpoints (list, get, patch for manual review)
Populate pages in database during crawl
Add migration to populate pages from previously finished crawls
Add ops method to add QA crawl data to page

The text was updated successfully, but these errors were encountered:

Fixes #1502 - Adds pages to database as they get added to Redis during crawl - Adds migration to add pages to database for older crawls from pages.jsonl and extraPages.jsonl files in WACZ - Adds GET, list GET, and PATCH update endpoints for pages - Adds POST (add), PATCH, and POST (delete) endpoints for page notes, each with their own id, timestamp, and user info in addition to text - Adds page_ops methods for 1. adding resources/urls to page, and 2. adding automated heuristics and supplemental info (mime, type, etc.) to page (for use in crawl QA job) - Modifies `Migration` class to accept kwargs so that we can pass in ops classes as needed for migrations - Deletes WACZ files and pages from database for failed crawls during crawl_finished process - Deletes crawl pages when a crawl is deleted Note: Requires a crawler version 1.0.0 beta3 or later, with support for `--writePagesToRedis` to populate pages at crawl completion. Beta 4 is configured in the test chart, which should be upgraded to stable 1.0.0 when it's released. Connected to webrecorder/browsertrix-crawler#464 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>

tw4l added the back end Requires back end dev work label Jan 30, 2024

github-project-automation bot added this to Webrecorder Projects Jan 30, 2024

github-project-automation bot moved this to Triage in Webrecorder Projects Jan 30, 2024

tw4l moved this from Triage to Ready in Webrecorder Projects Jan 30, 2024

tw4l moved this from Ready to Implementing in Webrecorder Projects Jan 30, 2024

Shrinks99 added this to the v1.10.0 milestone Jan 31, 2024

Shrinks99 mentioned this issue Jan 31, 2024

[Feature]: Only Archive New URLs (Incremental crawling) #1372

Open

tw4l mentioned this issue Feb 6, 2024

Add crawl pages and related API endpoints #1516

Merged

SuaYoo assigned tw4l Feb 21, 2024

ikreymer moved this from Implementing to Todo in Webrecorder Projects Feb 21, 2024

tw4l moved this from Todo to In Review in Webrecorder Projects Feb 21, 2024

tw4l changed the title ~~QA Backend: Add pages API endpoints~~ QA Backend: Add pages API endpoints, including support for manual review Feb 21, 2024

tw4l closed this as completed in #1516 Feb 28, 2024

github-project-automation bot moved this from In Review to Done! in Webrecorder Projects Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA Backend: Add pages API endpoints, including support for manual review #1502

QA Backend: Add pages API endpoints, including support for manual review #1502

tw4l commented Jan 30, 2024 •

edited

Loading

QA Backend: Add pages API endpoints, including support for manual review #1502

QA Backend: Add pages API endpoints, including support for manual review #1502

Comments

tw4l commented Jan 30, 2024 • edited Loading

tw4l commented Jan 30, 2024 •

edited

Loading