UFC Stats Scraper

A robust and efficient web scraper built with Scrapy to collect comprehensive UFC (Ultimate Fighting Championship) statistics from UFC Stats. This project provides an automated way to gather detailed fight data, fighter statistics, and event information.

Features

Incremental Updates: Only scrapes new events that haven't been processed before
Robust Error Handling: Gracefully handles network issues and malformed data
Automatic Database Backups: Creates timestamped backups before processing new data
Rate Limiting: Respects the website's resources with configurable delays and concurrency
Comprehensive Data Collection: Captures detailed statistics for:
- Events (date, location, name)
- Fights (matchups, results, weight classes)
- Fighters (stats, records, physical attributes)
- Round-by-round statistics

Database Schema

The data is stored in SQLite with the following structure:

Tables

event: Event details (id, name, date, location)
fight: Fight details and results
fighter: Fighter profiles and statistics
round: Round-by-round statistics

Relationships

fight
|-- event (via id_event)
|-- fighter (via id_red / id_blue)
round
|-- fight (via id_fight)

Installation

Clone the repository:

git clone git@github.com:Sylas4/ufc_scrapy.git
cd ufc_scrapy

Install dependencies:

pip install -r requirements.txt

Usage

To start scraping:

scrapy crawl spider

The scraper will:

Check for the most recent event in the database
Fetch only new events that haven't been processed
Create a backup of existing data
Process and store new fight data

Configuration

Key settings in settings.py:

DOWNLOAD_DELAY = 1: Delay between requests
CONCURRENT_REQUESTS = 8: Maximum concurrent requests
AUTOTHROTTLE_ENABLED = True: Automatic request rate control

Data Science Highlights

This project demonstrates several key data science and software engineering skills:

ETL Pipeline Development: Built a robust data extraction, transformation, and loading pipeline
Data Engineering: Designed and implemented a normalized SQLite database schema
Data Quality: Implemented comprehensive error handling and data validation
Incremental Processing: Optimized performance by only processing new or updated data
Version Control: Proper Git workflow with .gitignore and structured commits
Documentation: Clear code documentation and comprehensive README
Python Best Practices: Modular code design, PEP 8 compliance, and efficient data structures

Contact & Professional Info

📧 Email: martin.lilian4@gmail.com
💼 LinkedIn: https://www.linkedin.com/in/lilianmartin4/
🌐 Portfolio: https://lilian-martin.streamlit.app/

I'm currently open to Data Scientist positions where I can leverage my skills in:

Data pipeline development
Statistical analysis
Data analysis and visualisation
Machine Learning
Python development

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
ufc		ufc
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
queries.sql		queries.sql
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UFC Stats Scraper

Features

Database Schema

Tables

Relationships

Installation

Usage

Configuration

Data Science Highlights

Contact & Professional Info

License

About

Uh oh!

Releases

Packages

Languages

License

Sylas4/ufc_scrapy

Folders and files

Latest commit

History

Repository files navigation

UFC Stats Scraper

Features

Database Schema

Tables

Relationships

Installation

Usage

Configuration

Data Science Highlights

Contact & Professional Info

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages