-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add max_crawl_depth
property
#637
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -9,7 +9,7 @@ | |||||||||
from typing_extensions import Unpack | ||||||||||
|
||||||||||
from crawlee import EnqueueStrategy | ||||||||||
from crawlee._request import BaseRequestData | ||||||||||
from crawlee._request import BaseRequestData, CrawleeRequestData | ||||||||||
from crawlee._utils.blocked import RETRY_CSS_SELECTORS | ||||||||||
from crawlee._utils.urls import convert_to_absolute_url, is_url_absolute | ||||||||||
from crawlee.basic_crawler import BasicCrawler, BasicCrawlerOptions, ContextPipeline | ||||||||||
|
@@ -181,6 +181,12 @@ async def enqueue_links( | |||||||||
) -> None: | ||||||||||
kwargs.setdefault('strategy', EnqueueStrategy.SAME_HOSTNAME) | ||||||||||
|
||||||||||
if self._max_crawl_depth is not None and context.request.crawl_depth + 1 > self._max_crawl_depth: | ||||||||||
context.log.info( | ||||||||||
f'Skipping enqueue_links for URL "{context.request.url}" due to the maximum crawl depth limit.' | ||||||||||
) | ||||||||||
return | ||||||||||
|
||||||||||
Comment on lines
+184
to
+189
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the feedback! I’d like to understand a bit more about the approach you have in mind. When you mention “fill in default depth to the requests,” could you elaborate more on your suggestion? Are you envisioning a default of 0, or is there another baseline that would better ensure all requests adhere to max_crawl_depth? Regarding handling max depth in add_requests and avoiding repetition, would you like the check to ensure requests stay within max_crawl_depth to be included directly in the add_requests function of BasicCrawler? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think that the check could go directly to |
||||||||||
requests = list[BaseRequestData]() | ||||||||||
user_data = user_data or {} | ||||||||||
|
||||||||||
|
@@ -191,6 +197,9 @@ async def enqueue_links( | |||||||||
if label is not None: | ||||||||||
link_user_data.setdefault('label', label) | ||||||||||
|
||||||||||
data = {'crawlDepth': context.request.crawl_depth + 1} | ||||||||||
link_user_data.setdefault('__crawlee', CrawleeRequestData(**data)) | ||||||||||
Comment on lines
+200
to
+201
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I'd prefer not fiddling with field aliases if possible. |
||||||||||
|
||||||||||
if (url := link.attrs.get('href')) is not None: | ||||||||||
url = url.strip() | ||||||||||
|
||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably elaborate on the edge cases - for example, with
max_crawl_depth = 3
, do we process three or four "levels" of links? I'd assume that the start requests havecrawl_depth = 0
, and then we go all the way up to 3 and don't enqueue any further links, but it would be much better to have that stated explicitly in the docs.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Here’s a proposed docstring to make this clear:
Limits crawl depth from 0 (initial requests) up to the specified `max_crawl_depth`. Requests at the maximum depth are processed, but no further links are enqueued.
This would mean that, with max_crawl_depth = 3, requests will start at a crawl_depth of 0 and go up to 3, at which point new links won’t be enqueued. Does this align with what you had in mind, or are there any additional edge cases you’re concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, what you propose is perfect.