browser-scraping-benchmark (browswer)

A very simple benchmark to find the fastest browser for local scraping (just a tip of iceberg) and doesn't cover a lot (production evaluation, more complex cases, ...).

Please keep the following in mind:

This project was created for learning purposes.
All tests were conducted on a MacBook Pro (M1 Pro, 32GB RAM).
The benchmarks cover only a few specific scenarios.
These results should not be used as a definitive reference for serious performance analysis.
Main focus is on playwright (python) but a little twist with pydoll.
The code isn't perfect (it never is). If you spot an issue or have an idea for improvement (especially for performance), please feel free to open an Issue or submit a Pull Request.

Main take aways

For scraping with Playwright, stick with the bundled Chromium. It consistently proves to be fast, reliable, and well-supported., looks like they know their stuff.
Using your own installed browser is difficult. Support for using local, "branded" browser installations is limited.
Safari: You cannot test the official Safari browser directly. Playwright doesn't work with the branded version of Safari since it relies on patches. Instead, you can test using the most recent WebKit build.
Firefox: You must use the version Playwright installs, not your own. Zen browser is not supported because same reason. Playwright doesn't work with the branded version of Firefox since it relies on patches.
Arc: This Chromium-based browser can work, but it has a major quirk: it will fail to connect if you already have another Arc instance running. You must close all Arc windows before running the benchmark.

Here is how benchmark works

The benchmark runs the complete set of tests "retries" times (10 by default). For each individual run, it measures the time taken for each step. Finally, the results are aggregated into mean values for each browser/headless combination.

The benchmark runs a simple, repeatable process to measure performance.

Load Configs: When you start a run, the tool first reads test-sites.yaml (test samples). This file tells it which URLs to visit, what XPath selector to use for finding an element, and what the "correct" text should be. It also checks config.yaml to find the location of any custom browsers you're testing (this should be changed by you for you machine).
Run Loops: The script loops through each site for the number of iterations you specified (using --retries 10 is default).
Measure: The most important part is that every test cycle is broken down into five distinct, timed steps :
- Browser Launch: Time to open the browser process.
- New Page: Time to open a new blank tab.
- Page Navigation: Time to load the state of target URL ( await page.wait_for_load_state("load") ).
- Extract Content: Time to find the element and get its text, must be visible. (await element_locator.wait_for(state="visible", timeout=15000))
- Browser Close: Time to shut down the browser process.

Tested are these browsers with headless and non-headless mode:

Chromium (playwright)
Firefox (playwright)
Webkit (playwright)
Installed (app) Helium
Installed (app) Brave
Installed (app) Arc
Chromium (pydoll)

Usage

git clone
cd browser-scraping-benchmark

uv venv venv
source venv/bin/activate
pip install -e .

Running the Benchmark

You can now run the benchmark using the browser-scraping-benchmark command.

browser-scraping-benchmark run [BROWSER] [OPTIONS]

[BROWSER] (Required)

The name of the browser to test.

Browser Name	Description
`pydoll`	Runs the test using the pydoll library.
`chromium`	Runs the test using Playwright's bundled Chromium.
`firefox`	Runs the test using Playwright's bundled Firefox.
`webkit`	Runs the test using Playwright's bundled WebKit.

Note: You can also use custom browser names (e.g., chrome-beta, arc) if you have defined them in your config.yaml.

Options

Option	Description	Default
`--output <DIRECTORY>` (Required)	Specifies the folder where the JSON result files will be saved.	N/A
`--headless / --no-headless` (Optional)	Toggles headless mode.	`--headless`
`--retries <NUMBER>` (Optional)	The number of times to run the test suite for each browser. The results will be aggregated.	`1`
`--test-sites <FILE>` (Optional)	Path to your test sites configuration.	`test-sites.yaml`
`--config <FILE>` (Optional)	Path to your browser executable configuration.	`config.yaml`

Headless Mode Details

--headless (Default): Runs the browser without a visible UI window.
--no-headless: Runs the browser with a visible UI, which can be useful for debugging.

Example Usage

browser-scraping-benchmark run chromium --headless --retries 10 --output data/

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
browser_scraping_benchmark		browser_scraping_benchmark
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.yaml		config.yaml
main.py		main.py
pyproject.toml		pyproject.toml
test-sites.yaml		test-sites.yaml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

browser-scraping-benchmark (browswer)

Main take aways

Here is how benchmark works

Usage

Running the Benchmark

[BROWSER] (Required)

Options

Headless Mode Details

Example Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

browser-scraping-benchmark (browswer)

Main take aways

Here is how benchmark works

Usage

Running the Benchmark

[BROWSER] (Required)

Options

Headless Mode Details

Example Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages