[Feature]: Only Archive New URLs (Incremental crawling)

### Context

Prior reading: https://anjackson.net/2023/06/09/what-makes-a-large-website-large/

> The simplest way to deal with this risk of temporal incoherence is to have two crawls. A shallow and frequent crawl to get the most recent material, with a longer-running deeper crawl to gather the rest.

Large websites are difficult to crawl completely on a regular schedule.  Some websites are simply _too large_ to capture in their entirety every time they are crawled.  Large websites also have lots of data that _doesn't change_ and, if predictably the case, the value of re-capturing a certain page many times may be quite low.

Users have a limited amount of disk space and execution minutes, but their crawl workflows often capture the same content multiple times, broad crawls are wasting both resources to achieve the narrow goal of capturing updated content.

### What change would you like to see?

As a user, I want to be able to only capture new URLs so my scheduled crawl workflows don't run for longer than they need to re-capturing content I have already archived completely.

As a user, when I need to edit a crawl workflow to add an additional page prefex in scope (for sites that might have `example.com` and `cdn.example.com`), I don't want to have to re-crawl all of `example.com` just to get the links I missed the first time when I know the rest of those pages are good.

### User stories

1. One of our customers wants to capture a news website daily to keep a record of new stories of the day.  They currently have recurring crawls set up, however these crawls take a lot of execution time to complete and take up a lot of disk space full of duplicate content.  News websites (this one included) typically feature multiple index pages, sorted by topic, where new stories are listed.  Existing stories are linked to as references in stories or as recommendations below them.  These are what our customer doesn't want to crawl more than once.
    - This comes with the caveat that this system is impercise and doesn't have an understanding that page content might be updated with corrections or additional info.  For this use case to be properly served, we would need to implement support for RSS / Atom feeds that _do_ include this data, something this customer has also requested.

### Requirements

1. An option for "Only Archive New URLs" is available for both seeded and URL list workflows (when Include any linked page is toggled on for seeded workflows)
3. When crawling, if "Only Archive New URLs" is toggled on, the crawler should not archive pages it comes across if they have been marked as previouslyvisited.
4. URLs in the List of URLs and Crawl Start URL fields should always be saved and archived in order to give the user agency over which URLs should be saved to disk, even if they were visited in previous runs of the workflow.

**BONUS:** Optimize the crawling algorithm to track which previous URLs often have new URLs present on them, and prioritize them in the crawl queue to give time-limited workflows a better chance at crawling new content

### Todo

- Save the crawl queue results to the workflow for each crawl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Only Archive New URLs (Incremental crawling) #1372

Context

What change would you like to see?

User stories

Requirements

Todo

7 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

[Feature]: Only Archive New URLs (Incremental crawling) #1372

Description

Context

What change would you like to see?

User stories

Requirements

Todo

Activity

Shrinks99 commented on Nov 27, 2023

tw4l commented on Nov 27, 2023

Shrinks99 commented on Jan 31, 2024

Shrinks99 commented on Apr 25, 2024

7 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions