-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add max_crawl_depth
property
#637
base: master
Are you sure you want to change the base?
Conversation
69a8765
to
3386565
Compare
3386565
to
e44360d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty great! I found just some minor issues.
@@ -120,6 +120,9 @@ class BasicCrawlerOptions(TypedDict, Generic[TCrawlingContext]): | |||
configure_logging: NotRequired[bool] | |||
"""If True, the crawler will set up logging infrastructure automatically.""" | |||
|
|||
max_crawl_depth: NotRequired[int | None] | |||
"""Maximum crawl depth. If set, the crawler will stop crawling after reaching this depth.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably elaborate on the edge cases - for example, with max_crawl_depth = 3
, do we process three or four "levels" of links? I'd assume that the start requests have crawl_depth = 0
, and then we go all the way up to 3 and don't enqueue any further links, but it would be much better to have that stated explicitly in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Here’s a proposed docstring to make this clear:
Limits crawl depth from 0 (initial requests) up to the specified `max_crawl_depth`. Requests at the maximum depth are processed, but no further links are enqueued.
This would mean that, with max_crawl_depth = 3, requests will start at a crawl_depth of 0 and go up to 3, at which point new links won’t be enqueued. Does this align with what you had in mind, or are there any additional edge cases you’re concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what you propose is perfect.
if self._max_crawl_depth is not None and context.request.crawl_depth + 1 > self._max_crawl_depth: | ||
context.log.info( | ||
f'Skipping enqueue_links for URL "{context.request.url}" due to the maximum crawl depth limit.' | ||
) | ||
return | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the repetition in all crawlers is not great
- it is very easy to overlook this when implementing a new crawler
- it is also possible to add requests to the queue via
context.add_requests
- ideally we should fill in default depth to the requests - it doesn't make much sense to exempt some requests from
max_crawl_depth
- if there is no ergonomic way to support max depth in
add_requests
, we should make it clear in the docs that the user has to handle crawl depth manually
- ideally we should fill in default depth to the requests - it doesn't make much sense to exempt some requests from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! I’d like to understand a bit more about the approach you have in mind.
When you mention “fill in default depth to the requests,” could you elaborate more on your suggestion? Are you envisioning a default of 0, or is there another baseline that would better ensure all requests adhere to max_crawl_depth?
Currently, All requests have a default crawl_depth of 0 if not set.
Regarding handling max depth in add_requests and avoiding repetition, would you like the check to ensure requests stay within max_crawl_depth to be included directly in the add_requests function of BasicCrawler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that the check could go directly to add_requests
in BasicCrawler
. Is there any potential issue?
data = {'crawlDepth': context.request.crawl_depth + 1} | ||
link_user_data.setdefault('__crawlee', CrawleeRequestData(**data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data = {'crawlDepth': context.request.crawl_depth + 1} | |
link_user_data.setdefault('__crawlee', CrawleeRequestData(**data)) | |
link_user_data.crawlee = link_user_data.crawlee or CrawleeRequestData() | |
link_user_data.crawlee.crawl_depth = context.request.crawl_depth + 1 |
I'd prefer not fiddling with field aliases if possible.
Description
Issues
Testing
Checklist