-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
fix: playwright_stealth import and stealth application #1482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
WalkthroughThe startup path removes the Stealth wrapper, adds CDP/managed browser connection with context reuse, refactors browser argument construction, and moves stealth application to context initialization for Chromium. It updates flow to support both CDP/managed and traditional launches, with docstring adjustments. No public APIs changed. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant BrowserManager
participant Playwright
participant Chromium
participant Context
Client->>BrowserManager: start()
alt use_managed_browser or cdp_url
BrowserManager->>Playwright: async_playwright().start()
BrowserManager->>Chromium: connect_over_cdp(cdp_url)
BrowserManager->>Chromium: get or create default context
Chromium-->>BrowserManager: context
else local launch
BrowserManager->>Playwright: async_playwright().start()
BrowserManager->>Playwright: build browser_args
BrowserManager->>Chromium: launch(**browser_args) / firefox / webkit
BrowserManager->>Chromium: new_context()
Chromium-->>BrowserManager: context
end
BrowserManager->>Context: setup_context()
opt Chromium + enable_stealth
BrowserManager->>Context: add_init_script(stealth_script)
BrowserManager->>Context: set extra stealth headers
end
BrowserManager-->>Client: ready
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
crawl4ai/browser_manager.py (5)
675-676
: default_context is set to a Browser, not a BrowserContext (runtime breakage).
self.default_context = self.browser
is incorrect. Hooks (e.g.,on_browser_created
) and downstream code expect a BrowserContext. Create one and apply setup.- self.default_context = self.browser + # Create a proper default BrowserContext and apply setup + self.default_context = await self.create_browser_context() + await self.setup_context(self.default_context, is_default=True)
729-734
:downloads_path
is not a valid Browser.launch arg; set it on the context instead.Playwright expects download handling on the context (or persistent context), not on
launch
. Passing it here can raise “unexpected keyword” errors and currently pairs with a private_impl_obj
hack later.- if self.config.accept_downloads: - browser_args["downloads_path"] = self.config.downloads_path or os.path.join( - os.getcwd(), "downloads" - ) - os.makedirs(browser_args["downloads_path"], exist_ok=True) + # downloads_path belongs to BrowserContext (handled during context creation)Follow-up in
create_browser_context
below.
796-803
: Avoid private Playwright internals (_impl_obj._options
); set download opts at creation.Mutating
_impl_obj
is brittle and can break with Playwright updates. Moveaccept_downloads
anddownloads_path
intonew_context(...)
options.- if self.config.accept_downloads: - context.set_default_timeout(DOWNLOAD_PAGE_TIMEOUT) - context.set_default_navigation_timeout(DOWNLOAD_PAGE_TIMEOUT) - if self.config.downloads_path: - context._impl_obj._options["accept_downloads"] = True - context._impl_obj._options[ - "downloads_path" - ] = self.config.downloads_path + if self.config.accept_downloads: + context.set_default_timeout(DOWNLOAD_PAGE_TIMEOUT) + context.set_default_navigation_timeout(DOWNLOAD_PAGE_TIMEOUT)Add these during context creation (near Lines 915-924):
context_settings = { "user_agent": user_agent, "viewport": viewport_settings, "proxy": proxy_settings, - "accept_downloads": self.config.accept_downloads, + "accept_downloads": self.config.accept_downloads, "storage_state": self.config.storage_state, "ignore_https_errors": self.config.ignore_https_errors, "device_scale_factor": 1.0, "java_script_enabled": self.config.java_script_enabled, } + if self.config.accept_downloads: + downloads_dir = self.config.downloads_path or os.path.join(os.getcwd(), "downloads") + os.makedirs(downloads_dir, exist_ok=True) + context_settings["downloads_path"] = downloads_dir
851-858
: Potential NPE: headers may be None when deriving the User-Agent.Accessing
.get
onNone
will crash.- user_agent = self.config.headers.get("User-Agent", self.config.user_agent) + user_agent = (self.config.headers or {}).get("User-Agent", self.config.user_agent)
533-537
: Incorrect localStorage iteration; potential data loss and page leak.
for k, v in kvs:
is wrong; items are dicts{"name":..., "value":...}
. Also, whendst.pages
is empty, you create a page but never close it.- page = dst.pages[0] if dst.pages else await dst.new_page() + tmp_page_created = False + page = dst.pages[0] if dst.pages else None + if not page: + page = await dst.new_page() + tmp_page_created = True await page.goto(url, wait_until="domcontentloaded") - for k, v in kvs: - await page.evaluate("(k,v)=>localStorage.setItem(k,v)", k, v) + for item in kvs: + await page.evaluate("(k,v)=>localStorage.setItem(k,v)", item["name"], item["value"]) + if tmp_page_created: + await page.close()
🧹 Nitpick comments (4)
crawl4ai/browser_manager.py (4)
652-652
: Centralize Playwright startup (avoid duplication and lifecycle drift).You're directly calling
async_playwright().start()
here while a classmethodget_playwright
already encapsulates this logic (including undetected). Prefer using that to keep startup/teardown consistent.- self.playwright = await async_playwright().start() + self.playwright = await self.get_playwright(self.use_undetected)
804-812
: Guard against None headers when merging.
self.config.headers
can be None;combined_headers.update(self.config.headers)
would raise.- combined_headers.update(self.config.headers) + if self.config.headers: + combined_headers.update(self.config.headers)
506-512
: Return type mismatch (function returns a context but annotated as None).
clone_runtime_state
returnsdst
, yet signature is-> None
. Fix the annotation.-async def clone_runtime_state( +async def clone_runtime_state( src: BrowserContext, dst: BrowserContext, crawlerRunConfig: CrawlerRunConfig | None = None, browserConfig: BrowserConfig | None = None, -) -> None: +) -> BrowserContext:
1103-1151
: Playwright lifecycle when using CDP: verify cleanup.
close()
returns early whencdp_url
is set, which skipsself.playwright.stop()
. Ifstart()
previously created a Playwright instance, this can leak. Ensure that when the manager owns the Playwright instance, it is stopped even under CDP.A simple approach: track ownership via a flag (e.g.,
_owns_playwright
) and stop it regardless of CDP, while still avoiding shutting down the external browser.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
crawl4ai/browser_manager.py
(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
crawl4ai/browser_manager.py (2)
crawl4ai/async_crawler_strategy.py (2)
start
(125-134)start
(2235-2247)crawl4ai/js_snippet/__init__.py (1)
load_js_script
(5-18)
🔇 Additional comments (2)
crawl4ai/browser_manager.py (2)
654-664
: CDP managed path looks good.Connecting over CDP, reusing the persistent context when present, and calling
setup_context
is the right flow. Nice.
833-834
: Good: injectnavigator_overrider
early.Using
add_init_script
ensures the script runs before any page scripts. This is the correct place.
# Apply stealth headers and script to the context | ||
if self.config.enable_stealth and not self.use_undetected and self.config.browser_type == "chromium": | ||
from playwright_stealth import stealth | ||
prop = stealth.Properties(browser_type=stealth.BrowserType.CHROME) | ||
stealth_script = stealth.combine_scripts(prop, None) | ||
await stealth.generate_stealth_headers_async(prop, context) | ||
context.add_init_script(stealth_script) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
❓ Verification inconclusive
Two fixes for stealth application: await and coverage of existing pages.
- Missing
await
onadd_init_script
(async API); without it, stealth might not register before page creation. add_init_script
affects new documents only. If the connected persistent context already has open pages, consider reloading them or documenting that callers should open new pages.
if self.config.enable_stealth and not self.use_undetected and self.config.browser_type == "chromium":
from playwright_stealth import stealth
prop = stealth.Properties(browser_type=stealth.BrowserType.CHROME)
stealth_script = stealth.combine_scripts(prop, None)
await stealth.generate_stealth_headers_async(prop, context)
- context.add_init_script(stealth_script)
+ await context.add_init_script(stealth_script)
Optional: reload existing pages after applying stealth to persistent contexts:
# after add_init_script
for p in context.pages:
try:
await p.reload(wait_until="domcontentloaded")
except Exception:
pass
Please confirm whether stealth must apply to already-open pages in managed/persistent sessions or if “new pages only” is acceptable for your use case.
Fix: await context.add_init_script and reload open pages in persistent contexts
- Await add_init_script in the async flow — without await the init script may not be registered before pages are created.
- add_init_script only runs for new/navigated documents; reload existing pages in persistent/connected contexts if you need stealth applied immediately.
File: crawl4ai/browser_manager.py (around lines 835–842)
if self.config.enable_stealth and not self.use_undetected and self.config.browser_type == "chromium":
from playwright_stealth import stealth
prop = stealth.Properties(browser_type=stealth.BrowserType.CHROME)
stealth_script = stealth.combine_scripts(prop, None)
await stealth.generate_stealth_headers_async(prop, context)
- context.add_init_script(stealth_script)
+ await context.add_init_script(stealth_script)
Optional: reload existing pages after add_init_script when using persistent/connected contexts:
# after add_init_script
for p in context.pages:
try:
await p.reload(wait_until="domcontentloaded")
except Exception:
pass
Confirm whether stealth must apply to already-open pages in persistent sessions or if "new pages only" is acceptable.
🤖 Prompt for AI Agents
In crawl4ai/browser_manager.py around lines 835-842, the init script is added
without awaiting context.add_init_script and existing pages in
persistent/connected contexts are not reloaded, so stealth may not be applied to
already-open pages; change to await context.add_init_script(...) to ensure the
script is registered before pages are created, and if using persistent/connected
contexts iterate context.pages and reload each page (catch and ignore
exceptions) so the init script runs on existing documents — optionally gate the
reloads behind a check that the context is persistent/connected if stealth on
already-open pages is not required.
Summary
Fixes #1481
List of files changed and why
browser_manager.py - To correct the import path and apply the stealth scripts on each page
How Has This Been Tested?
Tested locally on a CF-protected website and verified that the crawl succeeded
Checklist:
Summary by CodeRabbit