-
-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Only Archive New URLs (Incremental crawling) #1372
Comments
From today's call: Ilya: Instead of manually specifying URLs that should be crawled every time, could this be accomplished with an extra hops function instead? Hank: This seems like it would have the tradeoff of only capturing new content if it is not very deep within the site OR having to re-crawl a lot of pages to obtain that result. Part of the goal here is enabling the crawling of "large websites" and not wasting execution time to visit lots of already captured content that the user doesn't wish to re-capture. The upside to this method might be that it's simpler than having to figure out all the pages that are "index pages" on a site where new content appears. I think that this added complexity is advantageous. Emma: Might be useful to attach an experation date after which if visited pages are found again, they'll be re-crawled? Hank: I like this a lot! |
I think it would be advantageous to enable both options:
|
Blocked by #1502 |
@anjackson Pinging you here for (personal) thoughts on this method of enabling continuous crawling of sites that change content often with index pages that list new content. |
I would like to expressly support this feature request. As a literature archive, we will regularly crawl blogs and journal-type web sources 2-4 times a year, and limiting capture to new or updated pages would really be a great help for large websites. As I understand it, this is also about timestamps, whereas #1753 would only require a large number of URLs to be distributed across multiple crawls. |
Context
Prior reading: https://anjackson.net/2023/06/09/what-makes-a-large-website-large/
Large websites are difficult to crawl completely on a regular schedule. Some websites are simply too large to capture in their entirety every time they are crawled. Large websites also have lots of data that doesn't change and, if predictably the case, the value of re-capturing a certain page many times may be quite low.
Users have a limited amount of disk space and execution minutes, but their crawl workflows often capture the same content multiple times, broad crawls are wasting both resources to achieve the narrow goal of capturing updated content.
What change would you like to see?
As a user, I want to be able to only capture new URLs so my scheduled crawl workflows don't run for longer than they need to re-capturing content I have already archived completely.
As a user, when I need to edit a crawl workflow to add an additional page prefex in scope (for sites that might have
example.com
andcdn.example.com
), I don't want to have to re-crawl all ofexample.com
just to get the links I missed the first time when I know the rest of those pages are good.User stories
Requirements
BONUS: Optimize the crawling algorithm to track which previous URLs often have new URLs present on them, and prioritize them in the crawl queue to give time-limited workflows a better chance at crawling new content
Todo
The text was updated successfully, but these errors were encountered: