This file provides guidance to programming agents when working with code in this repository.
All commands use uv (package manager) and poe (task runner):
# Install all dependencies (dev + extras + pre-commit + playwright)
uv run poe install-dev
# Run full check suite (lint + type-check + unit tests)
uv run poe check-code
# Linting (ruff format check + ruff check)
uv run poe lint
# Auto-fix formatting
uv run poe format
# Type checking (ty)
uv run poe type-check
# Run all unit tests
uv run poe unit-tests
# Run a single test file
uv run pytest tests/unit/path/to/test_file.py
# Run a single test by name
uv run pytest tests/unit/path/to/test_file.py::test_name -v
# Run tests with coverage XML report
uv run poe unit-tests-cov
# Build package
uv run poe build
# Clean build artifacts
uv run poe cleanNote: uv run poe unit-tests first runs tests marked @pytest.mark.run_alone in isolation, then runs the rest with -x (fail-fast) and parallelism via pytest-xdist.
- Linter/formatter: Ruff with
select = ["ALL"]and specific ignores - Line length: 120 characters
- Quotes: Single quotes (double for docstrings)
- Docstrings: Google format (enforced by Ruff)
- Type checker: ty (Astral's type checker), target Python 3.10
- Async mode: pytest-asyncio in
automode (no need for@pytest.mark.asyncio) - Commit format: Conventional Commits (
feat:,fix:,docs:,refactor:,test:, etc.)
BasicCrawler[TCrawlingContext, TStatisticsState]
├── AbstractHttpCrawler → HttpCrawler, BeautifulSoupCrawler, ParselCrawler
├── PlaywrightCrawler
└── AdaptivePlaywrightCrawler (extends PlaywrightCrawler)
- BasicCrawler (
src/crawlee/crawlers/_basic/): Core request lifecycle, autoscaling pool, retries, session management, router dispatch. Generic overTCrawlingContext. - AbstractHttpCrawler (
src/crawlee/crawlers/_abstract_http/): Adds HTTP client integration, response parsing, pre-navigation hooks. Generic over parser result type. - PlaywrightCrawler (
src/crawlee/crawlers/_playwright/): Browser-based crawling with Playwright.
Contexts are progressively enhanced through ContextPipeline middleware:
BasicCrawlingContext → HttpCrawlingContext → ParsedHttpCrawlingContext → BeautifulSoupCrawlingContext
Each middleware is an async generator that wraps the next handler, enabling setup/teardown around request processing.
Three-tier design:
- High-level:
Dataset,KeyValueStore,RequestQueueinsrc/crawlee/storages/ - Storage clients (
src/crawlee/storage_clients/):FileSystemStorageClient(default),MemoryStorageClient,SqlStorageClient,RedisStorageClient - Instance caching:
StorageInstanceManageris a global singleton that caches storage instances by ID/name
src/crawlee/_service_locator.py is a global singleton managing Configuration, EventManager, StorageClient, and StorageInstanceManager. Prevents double-initialization with ServiceConflictError.
Pluggable via HttpClient interface in src/crawlee/http_clients/:
ImpitHttpClient(default),HttpxHttpClient,CurlImpersonateHttpClient- Each provides
crawl()(for crawler pipeline) andsend_request()(for in-handler use)
Request (src/crawlee/_request.py) uses unique_key for deduplication. Lifecycle states: UNPROCESSED → DONE. Crawlee-specific metadata stored in user_data['__crawlee'].
@crawler.router.default_handler
async def handler(context: BeautifulSoupCrawlingContext): ...
@crawler.router.handler(label='detail')
async def detail(context: BeautifulSoupCrawlingContext): ...Requests are routed by their label field; unmatched requests go to the default handler.
src/crawlee/crawlers/- All crawler implementationssrc/crawlee/storages/- Dataset, KVS, RequestQueuesrc/crawlee/storage_clients/- Backend implementationssrc/crawlee/http_clients/- HTTP client implementationssrc/crawlee/browsers/- Playwright browser pool and pluginssrc/crawlee/sessions/- Session management with cookie persistencesrc/crawlee/events/- Event system (persist state, progress, aborting)src/crawlee/_autoscaling/- Autoscaled pool for concurrency controlsrc/crawlee/fingerprint_suite/- Anti-bot fingerprint generationsrc/crawlee/project_template/- CLI scaffolding template (excluded from linting)tests/unit/- Unit teststests/e2e/- End-to-end tests (requireapify-cli+ API token)