feat: Add `max_crawl_depth` property #637

Prathamesh010 · 2024-10-31T07:39:27Z

Description

Implements "max crawl depth"

Issues

Closes: Implement max crawl depth #460

Testing

Added tests

Checklist

CI passed

janbuchar

This is pretty great! I found just some minor issues.

janbuchar · 2024-10-31T10:02:22Z

src/crawlee/basic_crawler/_basic_crawler.py

@@ -120,6 +120,9 @@ class BasicCrawlerOptions(TypedDict, Generic[TCrawlingContext]):
    configure_logging: NotRequired[bool]
    """If True, the crawler will set up logging infrastructure automatically."""

+    max_crawl_depth: NotRequired[int | None]
+    """Maximum crawl depth. If set, the crawler will stop crawling after reaching this depth."""


We should probably elaborate on the edge cases - for example, with max_crawl_depth = 3, do we process three or four "levels" of links? I'd assume that the start requests have crawl_depth = 0, and then we go all the way up to 3 and don't enqueue any further links, but it would be much better to have that stated explicitly in the docs.

Good point! Here’s a proposed docstring to make this clear:
Limits crawl depth from 0 (initial requests) up to the specified `max_crawl_depth`. Requests at the maximum depth are processed, but no further links are enqueued.

This would mean that, with max_crawl_depth = 3, requests will start at a crawl_depth of 0 and go up to 3, at which point new links won’t be enqueued. Does this align with what you had in mind, or are there any additional edge cases you’re concerned about?

Yes, what you propose is perfect.

janbuchar · 2024-10-31T10:10:46Z

src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawler.py

+            if self._max_crawl_depth is not None and context.request.crawl_depth + 1 > self._max_crawl_depth:
+                context.log.info(
+                    f'Skipping enqueue_links for URL "{context.request.url}" due to the maximum crawl depth limit.'
+                )
+                return
+


the repetition in all crawlers is not great

it is very easy to overlook this when implementing a new crawler

it is also possible to add requests to the queue via context.add_requests

ideally we should fill in default depth to the requests - it doesn't make much sense to exempt some requests from max_crawl_depth

if there is no ergonomic way to support max depth in add_requests, we should make it clear in the docs that the user has to handle crawl depth manually

Thank you for the feedback! I’d like to understand a bit more about the approach you have in mind.

When you mention “fill in default depth to the requests,” could you elaborate more on your suggestion? Are you envisioning a default of 0, or is there another baseline that would better ensure all requests adhere to max_crawl_depth?
Currently, All requests have a default crawl_depth of 0 if not set.

Regarding handling max depth in add_requests and avoiding repetition, would you like the check to ensure requests stay within max_crawl_depth to be included directly in the add_requests function of BasicCrawler?

Yes, I think that the check could go directly to add_requests in BasicCrawler. Is there any potential issue?

janbuchar · 2024-10-31T10:18:40Z

src/crawlee/beautifulsoup_crawler/_beautifulsoup_crawler.py

+                data = {'crawlDepth': context.request.crawl_depth + 1}
+                link_user_data.setdefault('__crawlee', CrawleeRequestData(**data))


Suggested change

data = {'crawlDepth': context.request.crawl_depth + 1}

link_user_data.setdefault('__crawlee', CrawleeRequestData(**data))

link_user_data.crawlee = link_user_data.crawlee or CrawleeRequestData()

link_user_data.crawlee.crawl_depth = context.request.crawl_depth + 1

I'd prefer not fiddling with field aliases if possible.

Prathamesh010 force-pushed the max-depth branch from 69a8765 to 3386565 Compare October 31, 2024 08:50

feat: Add max_crawl_depth

e44360d

Prathamesh010 force-pushed the max-depth branch from 3386565 to e44360d Compare October 31, 2024 08:58

janbuchar self-requested a review October 31, 2024 09:45

janbuchar requested changes Oct 31, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add `max_crawl_depth` property #637

feat: Add `max_crawl_depth` property #637

Prathamesh010 commented Oct 31, 2024 •

edited

Loading

janbuchar left a comment

janbuchar Oct 31, 2024

Prathamesh010 Nov 1, 2024

janbuchar Nov 1, 2024

janbuchar Oct 31, 2024

Prathamesh010 Nov 1, 2024

janbuchar Nov 1, 2024

janbuchar Oct 31, 2024

		data = {'crawlDepth': context.request.crawl_depth + 1}
		link_user_data.setdefault('__crawlee', CrawleeRequestData(**data))

feat: Add max_crawl_depth property #637

Are you sure you want to change the base?

feat: Add max_crawl_depth property #637

Conversation

Prathamesh010 commented Oct 31, 2024 • edited Loading

Description

Issues

Testing

Checklist

janbuchar left a comment

Choose a reason for hiding this comment

janbuchar Oct 31, 2024

Choose a reason for hiding this comment

Prathamesh010 Nov 1, 2024

Choose a reason for hiding this comment

janbuchar Nov 1, 2024

Choose a reason for hiding this comment

janbuchar Oct 31, 2024

Choose a reason for hiding this comment

Prathamesh010 Nov 1, 2024

Choose a reason for hiding this comment

janbuchar Nov 1, 2024

Choose a reason for hiding this comment

janbuchar Oct 31, 2024

Choose a reason for hiding this comment

feat: Add `max_crawl_depth` property #637

feat: Add `max_crawl_depth` property #637

Prathamesh010 commented Oct 31, 2024 •

edited

Loading