Usedcars Scraper

A Scrapy-based web crawler designed to extract vehicle specifications
from the official BMW UK Used Cars portal.
The project features asynchronous data processing, automated cleaning
pipelines, and SQLite storage.

Project Structure

.
├── LICENSE
├── README.md                   # Project documentation
├── investigate.ipynb           # Data analysis & research notebook
├── poetry.lock                 # Locked dependencies
├── pyproject.toml              # Project configuration & dependencies
├── run.py                      # Main entry point for the spider
├── scrapy.cfg                  # Scrapy deployment configuration
└── usedcars/                   # Main package directory
    ├── items.py                # Data models
    ├── middlewares.py          # Proxy and Header rotation logic
    ├── pipelines.py            # Data validation, cleaning, and SQL insertion
    ├── settings.py             # Scraper configurations
    ├── spiders/                # Spider implementations
    │   └── usedcars_bmw.py
    ├── sql/                    # Database initialization scripts
    │   └── schema.sql
    └── utils.py                # Helper functions

Key Features

Custom Item Pipeline: Validates required fields, cleans mileage
data, and normalizes fuel types.
Asynchronous Database Insertion: Uses Twisted.enterprise.adbapi
for non-blocking SQLite operations.
Anti-Bot Measures: Implements random User-Agent rotation.
(Proxy support available but optional)

Installation & Setup

Ensure you have Poetry installed.

Clone the repository:

git clone <repository-url>
cd usedcars_scraper

Initialize the environment and install dependencies:

poetry install

Usage

You can launch the scraper using the provided entry point script or
the Scrapy CLI.

Using the run script (Recommended):

poetry run python run.py

Using Scrapy CLI:

poetry run scrapy crawl bmv_api

Configuration

The project uses a .env file for sensitive data and runtime
configuration.

1. Create the environment file

Run the following commands in your terminal to create and initialize
the .env file:

# Create the file
touch .env

# Add default configuration
cat <<EOF > .env
# Pagination depth
MAX_PAGE=5

# Database settings
SQLITE_DB=bmw_cars.db

# Proxies settings (comma-separated list)
# PROXY_LIST="http://user:pass@host:port,http://user:pass@host2:port"
PROXY_LIST=""

# Logging settings
LOG_LEVEL=INFO
LOG_STDOUT=0
EOF

*PROXY_LIST can be left empty for local startup.

2. Configuration Parameters

Variable	Description	Default
`MAX_PAGE`	Total number of pages to crawl from the API.	`5`
`SQLITE_DB`	Name of the SQLite database file created in the project root.	`bmw_cars.db`
`PROXY_LIST`	A comma-separated string of proxy URLs for rotation.	`""`
`LOG_LEVEL`	Verbosity of logs (`DEBUG`, `INFO`, `WARNING`, `ERROR`).	`INFO`
`LOG_STDOUT`	If set to `1`, redirects logs to the standard output.	`0`

License

This project is licensed under the MIT License - see the LICENSE
file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usedcars Scraper

Project Structure

Key Features

Installation & Setup

Usage

Configuration

1. Create the environment file

2. Configuration Parameters

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
usedcars		usedcars
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
investigate.ipynb		investigate.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run.py		run.py
scrapy.cfg		scrapy.cfg

Folders and files

Latest commit

History

Repository files navigation

Usedcars Scraper

Project Structure

Key Features

Installation & Setup

Usage

Configuration

1. Create the environment file

2. Configuration Parameters

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages