A robust and efficient web scraper built with Scrapy to collect comprehensive UFC (Ultimate Fighting Championship) statistics from UFC Stats. This project provides an automated way to gather detailed fight data, fighter statistics, and event information.
- Incremental Updates: Only scrapes new events that haven't been processed before
- Robust Error Handling: Gracefully handles network issues and malformed data
- Automatic Database Backups: Creates timestamped backups before processing new data
- Rate Limiting: Respects the website's resources with configurable delays and concurrency
- Comprehensive Data Collection: Captures detailed statistics for:
- Events (date, location, name)
- Fights (matchups, results, weight classes)
- Fighters (stats, records, physical attributes)
- Round-by-round statistics
The data is stored in SQLite with the following structure:
- event: Event details (id, name, date, location)
- fight: Fight details and results
- fighter: Fighter profiles and statistics
- round: Round-by-round statistics
fight
|-- event (via id_event)
|-- fighter (via id_red / id_blue)
round
|-- fight (via id_fight)
- Clone the repository:
git clone git@github.com:Sylas4/ufc_scrapy.git
cd ufc_scrapy- Install dependencies:
pip install -r requirements.txtTo start scraping:
scrapy crawl spiderThe scraper will:
- Check for the most recent event in the database
- Fetch only new events that haven't been processed
- Create a backup of existing data
- Process and store new fight data
Key settings in settings.py:
DOWNLOAD_DELAY = 1: Delay between requestsCONCURRENT_REQUESTS = 8: Maximum concurrent requestsAUTOTHROTTLE_ENABLED = True: Automatic request rate control
This project demonstrates several key data science and software engineering skills:
- ETL Pipeline Development: Built a robust data extraction, transformation, and loading pipeline
- Data Engineering: Designed and implemented a normalized SQLite database schema
- Data Quality: Implemented comprehensive error handling and data validation
- Incremental Processing: Optimized performance by only processing new or updated data
- Version Control: Proper Git workflow with .gitignore and structured commits
- Documentation: Clear code documentation and comprehensive README
- Python Best Practices: Modular code design, PEP 8 compliance, and efficient data structures
- 📧 Email: martin.lilian4@gmail.com
- 💼 LinkedIn: https://www.linkedin.com/in/lilianmartin4/
- 🌐 Portfolio: https://lilian-martin.streamlit.app/
I'm currently open to Data Scientist positions where I can leverage my skills in:
- Data pipeline development
- Statistical analysis
- Data analysis and visualisation
- Machine Learning
- Python development
This project is licensed under the MIT License - see the LICENSE file for details.