Skip to content

williambrach/browser-scraping-benchmark

Repository files navigation

browser-scraping-benchmark (browswer)

A very simple benchmark to find the fastest browser for local scraping (just a tip of iceberg) and doesn't cover a lot (production evaluation, more complex cases, ...).

Please keep the following in mind:

  • This project was created for learning purposes.

  • All tests were conducted on a MacBook Pro (M1 Pro, 32GB RAM).

  • The benchmarks cover only a few specific scenarios.

  • These results should not be used as a definitive reference for serious performance analysis.

  • Main focus is on playwright (python) but a little twist with pydoll.

  • The code isn't perfect (it never is). If you spot an issue or have an idea for improvement (especially for performance), please feel free to open an Issue or submit a Pull Request.

Main take aways

results

Here is how benchmark works

The benchmark runs the complete set of tests "retries" times (10 by default). For each individual run, it measures the time taken for each step. Finally, the results are aggregated into mean values for each browser/headless combination.

The benchmark runs a simple, repeatable process to measure performance.

  1. Load Configs: When you start a run, the tool first reads test-sites.yaml (test samples). This file tells it which URLs to visit, what XPath selector to use for finding an element, and what the "correct" text should be. It also checks config.yaml to find the location of any custom browsers you're testing (this should be changed by you for you machine).

  2. Run Loops: The script loops through each site for the number of iterations you specified (using --retries 10 is default).

  3. Measure: The most important part is that every test cycle is broken down into five distinct, timed steps :

    • Browser Launch: Time to open the browser process.
    • New Page: Time to open a new blank tab.
    • Page Navigation: Time to load the state of target URL ( await page.wait_for_load_state("load") ).
    • Extract Content: Time to find the element and get its text, must be visible. (await element_locator.wait_for(state="visible", timeout=15000))
    • Browser Close: Time to shut down the browser process.

Tested are these browsers with headless and non-headless mode:

  • Chromium (playwright)
  • Firefox (playwright)
  • Webkit (playwright)
  • Installed (app) Helium
  • Installed (app) Brave
  • Installed (app) Arc
  • Chromium (pydoll)

Usage

git clone
cd browser-scraping-benchmark
uv venv venv
source venv/bin/activate
pip install -e .

Running the Benchmark

You can now run the benchmark using the browser-scraping-benchmark command.

browser-scraping-benchmark run [BROWSER] [OPTIONS]

[BROWSER] (Required)

The name of the browser to test.

Browser Name Description
pydoll Runs the test using the pydoll library.
chromium Runs the test using Playwright's bundled Chromium.
firefox Runs the test using Playwright's bundled Firefox.
webkit Runs the test using Playwright's bundled WebKit.

Note: You can also use custom browser names (e.g., chrome-beta, arc) if you have defined them in your config.yaml.


Options

Option Description Default
--output <DIRECTORY> (Required) Specifies the folder where the JSON result files will be saved. N/A
--headless / --no-headless (Optional) Toggles headless mode. --headless
--retries <NUMBER> (Optional) The number of times to run the test suite for each browser. The results will be aggregated. 1
--test-sites <FILE> (Optional) Path to your test sites configuration. test-sites.yaml
--config <FILE> (Optional) Path to your browser executable configuration. config.yaml

Headless Mode Details

  • --headless (Default): Runs the browser without a visible UI window.
  • --no-headless: Runs the browser with a visible UI, which can be useful for debugging.

Example Usage

browser-scraping-benchmark run chromium --headless --retries 10 --output data/

About

A simple benchmark to find the fastest browser for local scraping

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages