Skip to content

Conversation

Sjoeborg
Copy link

@Sjoeborg Sjoeborg commented Sep 15, 2025

Summary

Fixes #1481

List of files changed and why

browser_manager.py - To correct the import path and apply the stealth scripts on each page

How Has This Been Tested?

Tested locally on a CF-protected website and verified that the crawl succeeded

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Summary by CodeRabbit

  • New Features
    • Support connecting to managed/external browsers via CDP, with context reuse when configured.
    • Enhanced stealth protection applied at the browser context level for Chromium.
  • Refactor
    • Streamlined startup and launch argument handling for more consistent behavior across Chromium, Firefox, and WebKit.
    • Improved reliability and compatibility in headless and custom-flag scenarios.
    • No breaking changes; existing APIs remain unchanged.

Copy link
Contributor

coderabbitai bot commented Sep 15, 2025

Walkthrough

The startup path removes the Stealth wrapper, adds CDP/managed browser connection with context reuse, refactors browser argument construction, and moves stealth application to context initialization for Chromium. It updates flow to support both CDP/managed and traditional launches, with docstring adjustments. No public APIs changed.

Changes

Cohort / File(s) Summary of Changes
Startup flow (Playwright init)
crawl4ai/browser_manager.py
Replaced Stealth-wrapped async_playwright with direct async_playwright().start().
CDP/Managed browser integration
crawl4ai/browser_manager.py
Added CDP connection: start/retrieve managed browser URL, connect via chromium.connect_over_cdp, reuse/create default context, then run setup_context.
Browser args construction
crawl4ai/browser_manager.py
Refactored _get_browser_args → _build_browser_args returning a dict (headless, args, channel, downloads_path, proxy); appended extra_args and used launch(**browser_args) across engines.
Context-level stealth
crawl4ai/browser_manager.py
Applied stealth at context setup for Chromium: import playwright_stealth, build Properties for Chrome, generate headers, inject stealth script via context.add_init_script; coexists with navigator_overrider.
Docs and flow comments
crawl4ai/browser_manager.py
Docstring updated to list stealth application as a step; clarified control flow comments.
Public API
crawl4ai/browser_manager.py
No changes to exported function/class signatures.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant BrowserManager
  participant Playwright
  participant Chromium
  participant Context

  Client->>BrowserManager: start()
  alt use_managed_browser or cdp_url
    BrowserManager->>Playwright: async_playwright().start()
    BrowserManager->>Chromium: connect_over_cdp(cdp_url)
    BrowserManager->>Chromium: get or create default context
    Chromium-->>BrowserManager: context
  else local launch
    BrowserManager->>Playwright: async_playwright().start()
    BrowserManager->>Playwright: build browser_args
    BrowserManager->>Chromium: launch(**browser_args) / firefox / webkit
    BrowserManager->>Chromium: new_context()
    Chromium-->>BrowserManager: context
  end

  BrowserManager->>Context: setup_context()
  opt Chromium + enable_stealth
    BrowserManager->>Context: add_init_script(stealth_script)
    BrowserManager->>Context: set extra stealth headers
  end
  BrowserManager-->>Client: ready
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Poem

Hoppity hop, I tweak and hide,
Scripts in the shadows, headers applied.
Through CDP tunnels I softly glide,
New args packed for a smoother ride.
A whisker’s wink—no errors now,
Stealthy paws, I take a bow. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Out of Scope Changes Check ⚠️ Warning The raw summary shows additional modifications beyond the stealth/import fix — notably CDP/managed-browser integration, a browser-argument refactor, and startup flow changes — which are not referenced in the linked issue or PR description and therefore appear out-of-scope for the narrow bugfix requested in [#1481]. Either split the unrelated browser-management and argument-refactor changes into separate PR(s) or explicitly document and justify those broader changes in this PR and add targeted tests and CI verification; if those edits were unintentional, revert them so this PR only contains the stealth/import fix.
✅ Passed checks (4 passed)
Check name Status Explanation
Title Check ✅ Passed The title "fix: playwright_stealth import and stealth application" succinctly and accurately summarizes the primary change in browser_manager.py described in the PR (correcting the playwright_stealth import and applying stealth scripts to contexts/pages) and is concise and specific enough for a quick scan of the history.
Linked Issues Check ✅ Passed The changes described in the PR objectives and raw summary address the core coding requirements of issue [#1481]: the import path for playwright_stealth was corrected and stealth scripts/headers are applied at the context/page level, which resolves the reported ImportError and implements the expected stealth behavior.
Description Check ✅ Passed The PR description follows the repository template by providing a Summary (Fixes #1481), a list of files changed with rationale, a brief "How Has This Been Tested?" note, and a Checklist; however the testing section is minimal (local manual verification) and several checklist items (docs, unit tests, style) remain unchecked, so the description is mostly complete but could use more test detail and completed checklist entries.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
crawl4ai/browser_manager.py (5)

675-676: default_context is set to a Browser, not a BrowserContext (runtime breakage).

self.default_context = self.browser is incorrect. Hooks (e.g., on_browser_created) and downstream code expect a BrowserContext. Create one and apply setup.

-            self.default_context = self.browser
+            # Create a proper default BrowserContext and apply setup
+            self.default_context = await self.create_browser_context()
+            await self.setup_context(self.default_context, is_default=True)

729-734: downloads_path is not a valid Browser.launch arg; set it on the context instead.

Playwright expects download handling on the context (or persistent context), not on launch. Passing it here can raise “unexpected keyword” errors and currently pairs with a private _impl_obj hack later.

-        if self.config.accept_downloads:
-            browser_args["downloads_path"] = self.config.downloads_path or os.path.join(
-                os.getcwd(), "downloads"
-            )
-            os.makedirs(browser_args["downloads_path"], exist_ok=True)
+        # downloads_path belongs to BrowserContext (handled during context creation)

Follow-up in create_browser_context below.


796-803: Avoid private Playwright internals (_impl_obj._options); set download opts at creation.

Mutating _impl_obj is brittle and can break with Playwright updates. Move accept_downloads and downloads_path into new_context(...) options.

-        if self.config.accept_downloads:
-            context.set_default_timeout(DOWNLOAD_PAGE_TIMEOUT)
-            context.set_default_navigation_timeout(DOWNLOAD_PAGE_TIMEOUT)
-            if self.config.downloads_path:
-                context._impl_obj._options["accept_downloads"] = True
-                context._impl_obj._options[
-                    "downloads_path"
-                ] = self.config.downloads_path
+        if self.config.accept_downloads:
+            context.set_default_timeout(DOWNLOAD_PAGE_TIMEOUT)
+            context.set_default_navigation_timeout(DOWNLOAD_PAGE_TIMEOUT)

Add these during context creation (near Lines 915-924):

         context_settings = {
             "user_agent": user_agent,
             "viewport": viewport_settings,
             "proxy": proxy_settings,
-            "accept_downloads": self.config.accept_downloads,
+            "accept_downloads": self.config.accept_downloads,
             "storage_state": self.config.storage_state,
             "ignore_https_errors": self.config.ignore_https_errors,
             "device_scale_factor": 1.0,
             "java_script_enabled": self.config.java_script_enabled,
         }
+        if self.config.accept_downloads:
+            downloads_dir = self.config.downloads_path or os.path.join(os.getcwd(), "downloads")
+            os.makedirs(downloads_dir, exist_ok=True)
+            context_settings["downloads_path"] = downloads_dir

851-858: Potential NPE: headers may be None when deriving the User-Agent.

Accessing .get on None will crash.

-        user_agent = self.config.headers.get("User-Agent", self.config.user_agent) 
+        user_agent = (self.config.headers or {}).get("User-Agent", self.config.user_agent)

533-537: Incorrect localStorage iteration; potential data loss and page leak.

for k, v in kvs: is wrong; items are dicts {"name":..., "value":...}. Also, when dst.pages is empty, you create a page but never close it.

-        page = dst.pages[0] if dst.pages else await dst.new_page()
+        tmp_page_created = False
+        page = dst.pages[0] if dst.pages else None
+        if not page:
+            page = await dst.new_page()
+            tmp_page_created = True
         await page.goto(url, wait_until="domcontentloaded")
-        for k, v in kvs:
-            await page.evaluate("(k,v)=>localStorage.setItem(k,v)", k, v)
+        for item in kvs:
+            await page.evaluate("(k,v)=>localStorage.setItem(k,v)", item["name"], item["value"])
+        if tmp_page_created:
+            await page.close()
🧹 Nitpick comments (4)
crawl4ai/browser_manager.py (4)

652-652: Centralize Playwright startup (avoid duplication and lifecycle drift).

You're directly calling async_playwright().start() here while a classmethod get_playwright already encapsulates this logic (including undetected). Prefer using that to keep startup/teardown consistent.

-        self.playwright = await async_playwright().start()
+        self.playwright = await self.get_playwright(self.use_undetected)

804-812: Guard against None headers when merging.

self.config.headers can be None; combined_headers.update(self.config.headers) would raise.

-            combined_headers.update(self.config.headers)
+            if self.config.headers:
+                combined_headers.update(self.config.headers)

506-512: Return type mismatch (function returns a context but annotated as None).

clone_runtime_state returns dst, yet signature is -> None. Fix the annotation.

-async def clone_runtime_state(
+async def clone_runtime_state(
     src: BrowserContext,
     dst: BrowserContext,
     crawlerRunConfig: CrawlerRunConfig | None = None,
     browserConfig: BrowserConfig | None = None,
-) -> None:
+) -> BrowserContext:

1103-1151: Playwright lifecycle when using CDP: verify cleanup.

close() returns early when cdp_url is set, which skips self.playwright.stop(). If start() previously created a Playwright instance, this can leak. Ensure that when the manager owns the Playwright instance, it is stopped even under CDP.

A simple approach: track ownership via a flag (e.g., _owns_playwright) and stop it regardless of CDP, while still avoiding shutting down the external browser.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e651e04 and 7baedaa.

📒 Files selected for processing (1)
  • crawl4ai/browser_manager.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
crawl4ai/browser_manager.py (2)
crawl4ai/async_crawler_strategy.py (2)
  • start (125-134)
  • start (2235-2247)
crawl4ai/js_snippet/__init__.py (1)
  • load_js_script (5-18)
🔇 Additional comments (2)
crawl4ai/browser_manager.py (2)

654-664: CDP managed path looks good.

Connecting over CDP, reusing the persistent context when present, and calling setup_context is the right flow. Nice.


833-834: Good: inject navigator_overrider early.

Using add_init_script ensures the script runs before any page scripts. This is the correct place.

Comment on lines +835 to 842
# Apply stealth headers and script to the context
if self.config.enable_stealth and not self.use_undetected and self.config.browser_type == "chromium":
from playwright_stealth import stealth
prop = stealth.Properties(browser_type=stealth.BrowserType.CHROME)
stealth_script = stealth.combine_scripts(prop, None)
await stealth.generate_stealth_headers_async(prop, context)
context.add_init_script(stealth_script)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Two fixes for stealth application: await and coverage of existing pages.

  • Missing await on add_init_script (async API); without it, stealth might not register before page creation.
  • add_init_script affects new documents only. If the connected persistent context already has open pages, consider reloading them or documenting that callers should open new pages.
         if self.config.enable_stealth and not self.use_undetected and self.config.browser_type == "chromium":
             from playwright_stealth import stealth
             prop = stealth.Properties(browser_type=stealth.BrowserType.CHROME)       
             stealth_script = stealth.combine_scripts(prop, None)
             await stealth.generate_stealth_headers_async(prop, context)
-            context.add_init_script(stealth_script)    
+            await context.add_init_script(stealth_script)

Optional: reload existing pages after applying stealth to persistent contexts:

# after add_init_script
for p in context.pages:
    try:
        await p.reload(wait_until="domcontentloaded")
    except Exception:
        pass

Please confirm whether stealth must apply to already-open pages in managed/persistent sessions or if “new pages only” is acceptable for your use case.


Fix: await context.add_init_script and reload open pages in persistent contexts

  • Await add_init_script in the async flow — without await the init script may not be registered before pages are created.
  • add_init_script only runs for new/navigated documents; reload existing pages in persistent/connected contexts if you need stealth applied immediately.

File: crawl4ai/browser_manager.py (around lines 835–842)

         if self.config.enable_stealth and not self.use_undetected and self.config.browser_type == "chromium":
             from playwright_stealth import stealth
             prop = stealth.Properties(browser_type=stealth.BrowserType.CHROME)       
             stealth_script = stealth.combine_scripts(prop, None)
             await stealth.generate_stealth_headers_async(prop, context)
-            context.add_init_script(stealth_script)    
+            await context.add_init_script(stealth_script)

Optional: reload existing pages after add_init_script when using persistent/connected contexts:

# after add_init_script
for p in context.pages:
    try:
        await p.reload(wait_until="domcontentloaded")
    except Exception:
        pass

Confirm whether stealth must apply to already-open pages in persistent sessions or if "new pages only" is acceptable.

🤖 Prompt for AI Agents
In crawl4ai/browser_manager.py around lines 835-842, the init script is added
without awaiting context.add_init_script and existing pages in
persistent/connected contexts are not reloaded, so stealth may not be applied to
already-open pages; change to await context.add_init_script(...) to ensure the
script is registered before pages are created, and if using persistent/connected
contexts iterate context.pages and reload each page (catch and ignore
exceptions) so the init script runs on existing documents — optionally gate the
reloads behind a check that the context is persistent/connected if stealth on
already-open pages is not required.

@ntohidi ntohidi changed the base branch from main to develop September 18, 2025 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: ImportError when using enable_stealth=True

1 participant