Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add max_crawl_depth property #637

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Prathamesh010
Copy link
Contributor

@Prathamesh010 Prathamesh010 commented Oct 31, 2024

Description

  • Implements "max crawl depth"

Issues

Testing

  • Added tests

Checklist

  • CI passed

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty great! I found just some minor issues.

@@ -120,6 +120,9 @@ class BasicCrawlerOptions(TypedDict, Generic[TCrawlingContext]):
configure_logging: NotRequired[bool]
"""If True, the crawler will set up logging infrastructure automatically."""

max_crawl_depth: NotRequired[int | None]
"""Maximum crawl depth. If set, the crawler will stop crawling after reaching this depth."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably elaborate on the edge cases - for example, with max_crawl_depth = 3, do we process three or four "levels" of links? I'd assume that the start requests have crawl_depth = 0, and then we go all the way up to 3 and don't enqueue any further links, but it would be much better to have that stated explicitly in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Here’s a proposed docstring to make this clear:
Limits crawl depth from 0 (initial requests) up to the specified `max_crawl_depth`. Requests at the maximum depth are processed, but no further links are enqueued.

This would mean that, with max_crawl_depth = 3, requests will start at a crawl_depth of 0 and go up to 3, at which point new links won’t be enqueued. Does this align with what you had in mind, or are there any additional edge cases you’re concerned about?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what you propose is perfect.

Comment on lines +184 to +189
if self._max_crawl_depth is not None and context.request.crawl_depth + 1 > self._max_crawl_depth:
context.log.info(
f'Skipping enqueue_links for URL "{context.request.url}" due to the maximum crawl depth limit.'
)
return

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. the repetition in all crawlers is not great
    • it is very easy to overlook this when implementing a new crawler
  2. it is also possible to add requests to the queue via context.add_requests
    • ideally we should fill in default depth to the requests - it doesn't make much sense to exempt some requests from max_crawl_depth
    • if there is no ergonomic way to support max depth in add_requests, we should make it clear in the docs that the user has to handle crawl depth manually

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback! I’d like to understand a bit more about the approach you have in mind.

When you mention “fill in default depth to the requests,” could you elaborate more on your suggestion? Are you envisioning a default of 0, or is there another baseline that would better ensure all requests adhere to max_crawl_depth?
Currently, All requests have a default crawl_depth of 0 if not set.

Regarding handling max depth in add_requests and avoiding repetition, would you like the check to ensure requests stay within max_crawl_depth to be included directly in the add_requests function of BasicCrawler?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that the check could go directly to add_requests in BasicCrawler. Is there any potential issue?

Comment on lines +200 to +201
data = {'crawlDepth': context.request.crawl_depth + 1}
link_user_data.setdefault('__crawlee', CrawleeRequestData(**data))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data = {'crawlDepth': context.request.crawl_depth + 1}
link_user_data.setdefault('__crawlee', CrawleeRequestData(**data))
link_user_data.crawlee = link_user_data.crawlee or CrawleeRequestData()
link_user_data.crawlee.crawl_depth = context.request.crawl_depth + 1

I'd prefer not fiddling with field aliases if possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement max crawl depth
2 participants