A Scrapy-based web crawler designed to extract vehicle specifications
from the official BMW UK Used Cars portal.
The project features asynchronous data processing, automated cleaning
pipelines, and SQLite storage.
.
├── LICENSE
├── README.md # Project documentation
├── investigate.ipynb # Data analysis & research notebook
├── poetry.lock # Locked dependencies
├── pyproject.toml # Project configuration & dependencies
├── run.py # Main entry point for the spider
├── scrapy.cfg # Scrapy deployment configuration
└── usedcars/ # Main package directory
├── items.py # Data models
├── middlewares.py # Proxy and Header rotation logic
├── pipelines.py # Data validation, cleaning, and SQL insertion
├── settings.py # Scraper configurations
├── spiders/ # Spider implementations
│ └── usedcars_bmw.py
├── sql/ # Database initialization scripts
│ └── schema.sql
└── utils.py # Helper functions
- Custom Item Pipeline: Validates required fields, cleans mileage
data, and normalizes fuel types. - Asynchronous Database Insertion: Uses
Twisted.enterprise.adbapi
for non-blocking SQLite operations. - Anti-Bot Measures: Implements random User-Agent rotation.
(Proxy support available but optional)
Ensure you have Poetry installed.
- Clone the repository:
git clone <repository-url>
cd usedcars_scraper
- Initialize the environment and install dependencies:
poetry install
You can launch the scraper using the provided entry point script or
the Scrapy CLI.
Using the run script (Recommended):
poetry run python run.py
Using Scrapy CLI:
poetry run scrapy crawl bmv_api
The project uses a .env file for sensitive data and runtime
configuration.
Run the following commands in your terminal to create and initialize
the .env file:
# Create the file
touch .env
# Add default configuration
cat <<EOF > .env
# Pagination depth
MAX_PAGE=5
# Database settings
SQLITE_DB=bmw_cars.db
# Proxies settings (comma-separated list)
# PROXY_LIST="http://user:pass@host:port,http://user:pass@host2:port"
PROXY_LIST=""
# Logging settings
LOG_LEVEL=INFO
LOG_STDOUT=0
EOF
*PROXY_LIST can be left empty for local startup.
| Variable | Description | Default |
|---|---|---|
MAX_PAGE |
Total number of pages to crawl from the API. | 5 |
SQLITE_DB |
Name of the SQLite database file created in the project root. | bmw_cars.db |
PROXY_LIST |
A comma-separated string of proxy URLs for rotation. | "" |
LOG_LEVEL |
Verbosity of logs (DEBUG, INFO, WARNING, ERROR). |
INFO |
LOG_STDOUT |
If set to 1, redirects logs to the standard output. |
0 |
This project is licensed under the MIT License - see the LICENSE
file for details.