Skip to content

Asbuga/usedcars_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Usedcars Scraper

A Scrapy-based web crawler designed to extract vehicle specifications
from the official BMW UK Used Cars portal.
The project features asynchronous data processing, automated cleaning
pipelines, and SQLite storage.

Project Structure

.
├── LICENSE
├── README.md                   # Project documentation
├── investigate.ipynb           # Data analysis & research notebook
├── poetry.lock                 # Locked dependencies
├── pyproject.toml              # Project configuration & dependencies
├── run.py                      # Main entry point for the spider
├── scrapy.cfg                  # Scrapy deployment configuration
└── usedcars/                   # Main package directory
    ├── items.py                # Data models
    ├── middlewares.py          # Proxy and Header rotation logic
    ├── pipelines.py            # Data validation, cleaning, and SQL insertion
    ├── settings.py             # Scraper configurations
    ├── spiders/                # Spider implementations
    │   └── usedcars_bmw.py
    ├── sql/                    # Database initialization scripts
    │   └── schema.sql
    └── utils.py                # Helper functions

Key Features

  • Custom Item Pipeline: Validates required fields, cleans mileage
    data, and normalizes fuel types.
  • Asynchronous Database Insertion: Uses Twisted.enterprise.adbapi
    for non-blocking SQLite operations.
  • Anti-Bot Measures: Implements random User-Agent rotation.
    (Proxy support available but optional)

Installation & Setup

Ensure you have Poetry installed.

  1. Clone the repository:
git clone <repository-url>
cd usedcars_scraper
  1. Initialize the environment and install dependencies:
poetry install

Usage

You can launch the scraper using the provided entry point script or
the Scrapy CLI.

Using the run script (Recommended):

poetry run python run.py

Using Scrapy CLI:

poetry run scrapy crawl bmv_api

Configuration

The project uses a .env file for sensitive data and runtime
configuration.

1. Create the environment file

Run the following commands in your terminal to create and initialize
the .env file:

# Create the file
touch .env

# Add default configuration
cat <<EOF > .env
# Pagination depth
MAX_PAGE=5

# Database settings
SQLITE_DB=bmw_cars.db

# Proxies settings (comma-separated list)
# PROXY_LIST="http://user:pass@host:port,http://user:pass@host2:port"
PROXY_LIST=""

# Logging settings
LOG_LEVEL=INFO
LOG_STDOUT=0
EOF

*PROXY_LIST can be left empty for local startup.

2. Configuration Parameters

Variable Description Default
MAX_PAGE Total number of pages to crawl from the API. 5
SQLITE_DB Name of the SQLite database file created in the project root. bmw_cars.db
PROXY_LIST A comma-separated string of proxy URLs for rotation. ""
LOG_LEVEL Verbosity of logs (DEBUG, INFO, WARNING, ERROR). INFO
LOG_STDOUT If set to 1, redirects logs to the standard output. 0

License

This project is licensed under the MIT License - see the LICENSE
file for details.

About

A Scrapy-based web crawler designed to extract vehicle specifications from the official BMW UK Used Cars portal.

Topics

Resources

License

Stars

Watchers

Forks

Contributors