Skip to content

Sylas4/ufc_scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UFC Stats Scraper

A robust and efficient web scraper built with Scrapy to collect comprehensive UFC (Ultimate Fighting Championship) statistics from UFC Stats. This project provides an automated way to gather detailed fight data, fighter statistics, and event information.

Features

  • Incremental Updates: Only scrapes new events that haven't been processed before
  • Robust Error Handling: Gracefully handles network issues and malformed data
  • Automatic Database Backups: Creates timestamped backups before processing new data
  • Rate Limiting: Respects the website's resources with configurable delays and concurrency
  • Comprehensive Data Collection: Captures detailed statistics for:
    • Events (date, location, name)
    • Fights (matchups, results, weight classes)
    • Fighters (stats, records, physical attributes)
    • Round-by-round statistics

Database Schema

The data is stored in SQLite with the following structure:

Tables

  • event: Event details (id, name, date, location)
  • fight: Fight details and results
  • fighter: Fighter profiles and statistics
  • round: Round-by-round statistics

Relationships

fight
|-- event (via id_event)
|-- fighter (via id_red / id_blue)
round
|-- fight (via id_fight)

Installation

  1. Clone the repository:
git clone git@github.com:Sylas4/ufc_scrapy.git
cd ufc_scrapy
  1. Install dependencies:
pip install -r requirements.txt

Usage

To start scraping:

scrapy crawl spider

The scraper will:

  1. Check for the most recent event in the database
  2. Fetch only new events that haven't been processed
  3. Create a backup of existing data
  4. Process and store new fight data

Configuration

Key settings in settings.py:

  • DOWNLOAD_DELAY = 1: Delay between requests
  • CONCURRENT_REQUESTS = 8: Maximum concurrent requests
  • AUTOTHROTTLE_ENABLED = True: Automatic request rate control

Data Science Highlights

This project demonstrates several key data science and software engineering skills:

  • ETL Pipeline Development: Built a robust data extraction, transformation, and loading pipeline
  • Data Engineering: Designed and implemented a normalized SQLite database schema
  • Data Quality: Implemented comprehensive error handling and data validation
  • Incremental Processing: Optimized performance by only processing new or updated data
  • Version Control: Proper Git workflow with .gitignore and structured commits
  • Documentation: Clear code documentation and comprehensive README
  • Python Best Practices: Modular code design, PEP 8 compliance, and efficient data structures

Contact & Professional Info

I'm currently open to Data Scientist positions where I can leverage my skills in:

  • Data pipeline development
  • Statistical analysis
  • Data analysis and visualisation
  • Machine Learning
  • Python development

License

This project is licensed under the MIT License - see the LICENSE file for details.