Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for pre_navigation_hooks in PlaywrightCrawler #602

Closed

Conversation

Yashwanth-Chandrakumar
Copy link

Description

This implementation adds support for preNavigationHooks in the PlaywrightCrawler

Use the new pre_navigation_hooks:

`async def example_hook(context: PlaywrightCrawlingContext, goto_options: dict) -> None:
await context.page.evaluate("window.foo = 'bar';")
goto_options['timeout'] = 60000 # Set a custom timeout for navigation

crawler = PlaywrightCrawler(
# other options...
pre_navigation_hooks=[example_hook]
)`

Issues

#427

  • CI passed

@Yashwanth-Chandrakumar Yashwanth-Chandrakumar changed the title Added support for preNavigationHooks in Playwright feat: Added support for preNavigationHooks in Playwright Oct 18, 2024
@Yashwanth-Chandrakumar
Copy link
Author

@janbuchar Why is this not merged

@janbuchar
Copy link
Collaborator

@janbuchar Why is this not merged

  1. because it was submitted just before the weekend
  2. because we want to review the code first

We value your contribution, but please, keep your tone civil.

@B4nan B4nan changed the title feat: Added support for preNavigationHooks in Playwright feat: Add support for preNavigationHooks in Playwright Oct 21, 2024
@B4nan B4nan changed the title feat: Add support for preNavigationHooks in Playwright feat: Add support for pre_navigation_hooks in Playwright Oct 21, 2024
@B4nan B4nan changed the title feat: Add support for pre_navigation_hooks in Playwright feat: Add support for pre_navigation_hooks in PlaywrightCrawler Oct 21, 2024
@B4nan
Copy link
Member

B4nan commented Oct 21, 2024

And let me add a third reason, there are no tests nor documentation for this new feature.

Also, this option should be available for all crawler types (except BasicCrawler which is not making any requests on its own), not just playwright.

@@ -190,15 +201,13 @@ async def _handle_blocked_request(
if self._retry_on_blocked:
status_code = crawling_context.response.status

# Check if the session is blocked based on the HTTP status code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't delete random comments.

from ._playwright_crawling_context import PlaywrightCrawlingContext
except ImportError as exc:
raise ImportError(
"To import anything from this subpackage, you need to install the 'playwright' extra."
"For example, if you use pip, run `pip install 'crawlee[playwright]'`.",
) from exc

__all__ = ['PlaywrightCrawler', 'PlaywrightCrawlingContext']
__all__ = ['PlaywrightCrawler', 'PlaywrightCrawlingContext', 'PlaywrightHook']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to export this type?

@@ -48,6 +49,7 @@ def __init__(
browser_pool: BrowserPool | None = None,
browser_type: BrowserType | None = None,
headless: bool | None = None,
pre_navigation_hooks: List[PlaywrightHook] | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a perfectly valid way of passing the hooks, but have you considered adding a decorator instead/as well? We did a similar change for request handlers because Python does not have anonymous functions (that can contain more than one statement).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on adding a decorator also mate

Comment on lines +120 to +121
response=None,
enqueue_links=None, # We'll set this later
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a no-go. First, it won't pass the type checker. Second, nullable attributes lead to errors, mental overhead and hard to read code.

Instead, we should add one more parent class to PlaywrightCrawlingContext that only adds the page attribute. Then, we should split the _make_http_request into two stages - one that opens the page and executes prenavigation hooks, and second that does the navigation. The context pipeline (see line 78) should be responsible for chaining these.

Does that make sense?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will work on that

@Yashwanth-Chandrakumar
Copy link
Author

@janbuchar Why is this not merged

  1. because it was submitted just before the weekend
  2. because we want to review the code first

We value your contribution, but please, keep your tone civil.

Hi Mate I am really sorry for what is said i really did not mean it that way.

@janbuchar
Copy link
Collaborator

Closing in favor of #631

@janbuchar janbuchar closed this Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants