Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA Backend: Add pages API endpoints, including support for manual review #1502

Closed
4 tasks done
tw4l opened this issue Jan 30, 2024 · 0 comments · Fixed by #1516
Closed
4 tasks done

QA Backend: Add pages API endpoints, including support for manual review #1502

tw4l opened this issue Jan 30, 2024 · 0 comments · Fixed by #1516
Assignees
Labels
back end Requires back end dev work
Milestone

Comments

@tw4l
Copy link
Member

tw4l commented Jan 30, 2024

  • Add crawls/<crawl_id>/pages API endpoints (list, get, patch for manual review)
  • Populate pages in database during crawl
  • Add migration to populate pages from previously finished crawls
  • Add ops method to add QA crawl data to page
@tw4l tw4l added the back end Requires back end dev work label Jan 30, 2024
@tw4l tw4l moved this from Triage to Ready in Webrecorder Projects Jan 30, 2024
@tw4l tw4l moved this from Ready to Implementing in Webrecorder Projects Jan 30, 2024
@Shrinks99 Shrinks99 added this to the v1.10.0 milestone Jan 31, 2024
@ikreymer ikreymer moved this from Implementing to Todo in Webrecorder Projects Feb 21, 2024
@tw4l tw4l moved this from Todo to In Review in Webrecorder Projects Feb 21, 2024
@tw4l tw4l changed the title QA Backend: Add pages API endpoints QA Backend: Add pages API endpoints, including support for manual review Feb 21, 2024
tw4l added a commit that referenced this issue Feb 28, 2024
Fixes #1502 

- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted

Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.

Connected to webrecorder/browsertrix-crawler#464

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
@github-project-automation github-project-automation bot moved this from In Review to Done! in Webrecorder Projects Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
back end Requires back end dev work
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants