A python API for baseball data working with data sources from MLBAM Gameday data, Baseball Savant, and Retrosheet. Stores data using SQLAlchemy. Returns data in PANDAS data frames.
Data use is always subject to licenses by MLB Advanced Media License, retrosheet, and the project license.
This currently partially works. Expect continued updates and changes to database structure and data models. If you're curious about using MLBAM gameday data, the best source is PitchRx. Obviously, it's in R, so you'll need to check out learning R, or find another option if python is your thing.
The project has two simple main goals:
- Provide a database storage for baseball data using SQLAlchemy.
- Serve this data back for analysis in dataframes using PANDAS.
Right now, I see three main paths to bring this project online, reading and storing downloaded data, creating the database structure, and serving queries in dataframes:
-
Using Gameday XML data
-
Store XML files
-
Parse XML files
-
Update XML files
-
Format XML files for database insertion
-
Option to delete files after inserting into a database
-
-
Using Baseball Savant data
-
Store and Parse CSV files
-
Delete files after inserting into a database
-
Insert Into database
-
Update Baseball Savant Trajectory Data
-
-
Using Retrosheet data
-
Download event files
-
Parse event files using chadwick
-
Windows: include chadwick executables and call to parse
-
Mac: Require installation via homebrew:
brew install chadwick
-
Linux: Provide installation instructions
-
-
Store data in database
-
Update database with new data
-
Delete files after insertion
-
-
Create and Maintain Database
-
Create database structure
-
Create database relationships
-
Create database from fresh install
-
Update database
-
Join different databases (MLBGameDay, Baseball Savant, Retrosheet)
-
-
PANDAS integration
-
Serve initial queries into dataframes
Being a pythonista, I'm slightly jealous of the regularly updated PitchRx CRAN package. This will hopefully provide an alternate for use in python development.
While getting initial functionality, I hope to provide added support for:
- different database type (MySQL, PostgreSQL, etc...)
- external data such as travel distances, weather information, etc...
- OpenWAR, cFIP / DRA, or other advanced metrics
Thanks to MLB Advanced Media for making gameday and pitchf/x data public.
Thank you Daren Willman for creating baseball savant.
Many thanks to all those who support and add to Retrosheet!
The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at "www.retrosheet.org".