pybbda
is a package for
Python Baseball Data and Analysis.
pybbda
aims to provide a uniform framework for
accessing baseball data from various sources.
The data are exposed as pandas
DataFrames
The data sources it currently supports are:
-
Lahman data
-
Baseball Reference WAR
-
Fangraphs leaderboards and park factors
-
Retrosheet event data
-
Statcast pitch-by-pitch data
pybbda
also provides analysis tools.
It currently supports:
-
Marcel projections
-
Batted ball trajectories
-
Run expectancy via Markov chains
The following are planned for a future release:
-
Simulations
-
and more...!
This package is available on PyPI, so you can install it with
pip
,
$ pip install -U pybbda
Or you can install the latest master branch
directly from the github repo using
pip
,
$ pip install git+https://github.com/bdilday/pybbda.git
or download the source,
$ git clone git@github.com:bdilday/pybbda.git
$ cd pybbda
$ pip install .
This package explicitly
supports Python 3.6
andPython 3.7
. It aims
to support Python 3.8
but this is not guaranteed.
It explicitly does not support any versions
prior to Python 3.6
, includingPython 2.7
.
This package ships without any data. Instead it provides tools to fetch and store data from a variety of sources.
To install data you can use the update
tool in the pybbda.data.tools
sub-module.
Example,
$ python -m pybbda.data.tools.update -h
usage: update.py [-h] [--data-root DATA_ROOT] --data-source
{Lahman,BaseballReference,Fangraphs,retrosheet,all} [--make-dirs]
[--overwrite] [--min-year MIN_YEAR] [--max-year MAX_YEAR]
[--num-threads NUM_THREADS]
optional arguments:
-h, --help show this help message and exit
--data-root DATA_ROOT
Root directory for data storage
--data-source {Lahman,BaseballReference,Fangraphs,retrosheet,all}
Update source
--make-dirs Make root dir if does not exist
--overwrite Overwrite files if they exist
--min-year MIN_YEAR Min year to download
--max-year MAX_YEAR Max year to download
--num-threads NUM_THREADS
Number of threads to use for downloads
The data will be downloaded to --data-root
, which defaults to the
PYBBDA_DATA_ROOT
Detailed instructions are provided in the documentation
After installing some or all of the data, you can start using the package.
Following is an example of accessing Lahman data. More examples are included in the documentation
>>> from pybbda.data import LahmanData
>>> lahman_data = LahmanData()
>>> batting_df= lahman_data.batting
INFO:pybbda.data.sources.lahman.data:data:searching for file /home/bdilday/.pybbda/data/Lahman/Batting.csv
>>> batting_df.head()
playerID yearID stint teamID lgID G AB R H 2B 3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
0 abercda01 1871 1 TRO NaN 1 4 0 0 0 0 0 0.0 0.0 0.0 0 0.0 NaN NaN NaN NaN 0.0
1 addybo01 1871 1 RC1 NaN 25 118 30 32 6 0 0 13.0 8.0 1.0 4 0.0 NaN NaN NaN NaN 0.0
2 allisar01 1871 1 CL1 NaN 29 137 28 40 4 5 0 19.0 3.0 1.0 2 5.0 NaN NaN NaN NaN 1.0
3 allisdo01 1871 1 WS3 NaN 27 133 28 44 10 2 2 27.0 1.0 1.0 0 2.0 NaN NaN NaN NaN 0.0
4 ansonca01 1871 1 RC1 NaN 25 120 29 39 11 3 0 16.0 6.0 2.0 2 1.0 NaN NaN NaN NaN 0.0
>>> batting_df.groupby("playerID").HR.sum().sort_values(ascending=False)
playerID
bondsba01 762
aaronha01 755
ruthba01 714
rodrial01 696
mayswi01 660
...
mcconra01 0
mccolal01 0
mccluse01 0
mcclula01 0
aardsda01 0
Name: HR, Length: 19689, dtype: int64
There is a cli tool for computing run expectancies from Markov chains.
$ python -m pybbda.analysis.run_expectancy.markov.cli --help
This Markov chain uses a lineup of 9 batters instead of assuming each batter has the same characteristics. You can also assign running probabilities, although they apply to all batters equally.
You can assign batting-event probabilities using a sequence of
probabilities, or by referencing a player-season with the
format {playerID}_{season}
, where playerID is the
Lahman ID and season is a 4-digit year. For example, to
refer to Rickey Henderson's 1982 season, use henderi01_1982
.
The lineup is assigned by giving the lineup slot followed by either 5 probabilities, or a player-season id. The lineup-slot 0 is a code to assign all nine batters to this value. Any other specific slots will be filled in as noted.
The number of outs to model is 3 by default. It can be changed by setting the
environment variable PYBBDA_MAX_OUTS
.
Example: Use a default set of probabilities for all 9 slots with no taking extra bases
$ python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 --running-probs 0 0 0 0
mean score per 27 outs = 3.5227
std. score per 27 outs = 2.8009
Example: Use a default set of probabilities for all 9 slots with default probabilities for taking extra bases
$ python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03
mean score per 27 outs = 4.2242
std. score per 27 outs = 3.0161
Example: Use a default set of probabilities for all 9 slots but let Rickey Henderson 1982 bat leadoff (using 27 outs, instead of 3)
$ PYBBDA_MAX_OUTS=27 python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 -i 1 henderi01_1982
WARNING:pybbda:__init__:Environment variable PYBBDA_DATA_ROOT is not set, defaulting to /home/bdilday/github/pybbda/pybbda/data/assets
INFO:pybbda.data.sources.lahman.data:data:searching for file /home/bdilday/github/pybbda/pybbda/data/assets/Lahman/Batting.csv
mean score per 27 outs = 4.3628
std. score per 27 outs = 3.0999
Example: Use a default set of probabilities for all 9 slots but let Rickey Henderson 1982 bat leadoff and Babe Ruth 1927 bat clean-up (using 27 outs, instead of 3)
$ PYBBDA_MAX_OUTS=27 python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 -i 1 henderi01_1982 -i 4 ruthba01_1927
WARNING:pybbda:__init__:Environment variable PYBBDA_DATA_ROOT is not set, defaulting to /home/bdilday/github/pybbda/pybbda/data/assets
INFO:pybbda.data.sources.lahman.data:data:searching for file /home/bdilday/github/pybbda/pybbda/data/assets/Lahman/Batting.csv
mean score per 27 outs = 5.1420
std. score per 27 outs = 3.3996
Contributions from the community are welcome. See the contributing guide.