Skip to content

Commit

Permalink
Updates requirements (#30)
Browse files Browse the repository at this point in the history
* updates requirements and retrosheet functionality and tests

* bumps version to 0.1.2

* updates CI
  • Loading branch information
bdilday authored Jun 13, 2020
1 parent 06afc34 commit e1428e6
Show file tree
Hide file tree
Showing 11 changed files with 197 additions and 22 deletions.
3 changes: 3 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@ jobs:
- run:
command: python -m pybaseballdatana.data.tools.update --data-source Fangraphs --make-dirs --min-year 2019 --max-year 2019 --num-threads 2
name: Update Fangraphs
- run:
command: python -m pybaseballdatana.data.tools.update --min-year 1982 --max-year 1982 --data-source retrosheet
name: Update Retrosheet Events
- run:
command: make lint
name: Lint
Expand Down
67 changes: 61 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,12 @@ The data sources it currently supports are:

* Fangraphs leaderboards

* Retrosheet event data

The following are planned for a future release:

* Statcast play-by-play data

* Retrosheet event data

### analysis

Expand All @@ -43,21 +44,27 @@ The following are planned for a future release:

## Installation

This package is available on github.
This package is available on PyPI, so you can install it with
`pip`,

```bash
$ pip install -U pybbda
```

You can install directly from the repo using
Or you can install the latest master branch
directly from the github repo using
`pip`,

```python
pip install git+https://github.com/bdilday/pybbda.git
```bash
$ pip install git+https://github.com/bdilday/pybbda.git
```

or download the source,

```bash
$ git clone git@github.com:bdilday/pybbda.git
$ cd pybbda
$ pip install -e .
$ pip install .
```

### Requirements
Expand Down Expand Up @@ -199,6 +206,10 @@ INFO:pybaseballdatana.data.sources.fangraphs._update:_update:saving file to /hom
INFO:pybaseballdatana.data.sources.fangraphs._update:_update:saving file to /home/bdilday/.pybbda/data/Fangraphs/fg_pit_2019.csv.gz
```
Example retrosheet event data. This downloads the event data
and stores it in a `sqlite` database, located in `--data-root`
Example to download all sources,
```bash
Expand Down Expand Up @@ -335,6 +346,50 @@ INFO:pybaseballdatana.data.sources.baseball_reference.data:data:searching for fi
[6 rows x 323 columns]
```
### Retrosheet events
Load data frame ID
```python
>>> from pybaseballdatana.data import RetrosheetData
>>> retrosheet_data = RetrosheetData()
>>> retrosheet_data.df_from_team_id("1982OAK")
GAME_ID AWAY_TEAM_ID INN_CT BAT_HOME_ID OUTS_CT BALLS_CT ... ASS7_FLD_CD ASS8_FLD_CD ASS9_FLD_CD ASS10_FLD_CD UNKNOWN_OUT_EXC_FL UNCERTAIN_PLAY_EXC_FL
0 OAK198204060 CAL 1 0 0 0 ... 0 0 0 0 F F
1 OAK198204060 CAL 1 0 1 0 ... 0 0 0 0 F F
2 OAK198204060 CAL 1 0 2 0 ... 0 0 0 0 F F
3 OAK198204060 CAL 1 1 0 0 ... 0 0 0 0 F F
4 OAK198204060 CAL 1 1 0 0 ... 0 0 0 0 F F
... ... ... ... ... ... ... ... ... ... ... ... ... ...
6204 OAK198209260 KCA 8 1 2 0 ... 0 0 0 0 F F
6205 OAK198209260 KCA 8 1 2 0 ... 0 0 0 0 F F
6206 OAK198209260 KCA 9 0 0 0 ... 0 0 0 0 F F
6207 OAK198209260 KCA 9 0 1 0 ... 0 0 0 0 F F
6208 OAK198209260 KCA 9 0 2 0 ... 0 0 0 0 F F
[6209 rows x 159 columns]
```
Load data frame from URl
```python
retrosheet_data.df_from_file("https://raw.githubusercontent.com/chadwickbureau/retrosheet/master/event/regular/1982OAK.EVA")
GAME_ID AWAY_TEAM_ID INN_CT BAT_HOME_ID OUTS_CT BALLS_CT ... ASS7_FLD_CD ASS8_FLD_CD ASS9_FLD_CD ASS10_FLD_CD UNKNOWN_OUT_EXC_FL UNCERTAIN_PLAY_EXC_FL
0 OAK198204060 CAL 1 0 0 0 ... 0 0 0 0 F F
1 OAK198204060 CAL 1 0 1 0 ... 0 0 0 0 F F
2 OAK198204060 CAL 1 0 2 0 ... 0 0 0 0 F F
3 OAK198204060 CAL 1 1 0 0 ... 0 0 0 0 F F
4 OAK198204060 CAL 1 1 0 0 ... 0 0 0 0 F F
... ... ... ... ... ... ... ... ... ... ... ... ... ...
6204 OAK198209260 KCA 8 1 2 0 ... 0 0 0 0 F F
6205 OAK198209260 KCA 8 1 2 0 ... 0 0 0 0 F F
6206 OAK198209260 KCA 9 0 0 0 ... 0 0 0 0 F F
6207 OAK198209260 KCA 9 0 1 0 ... 0 0 0 0 F F
6208 OAK198209260 KCA 9 0 2 0 ... 0 0 0 0 F F
[6209 rows x 159 columns]
```
### Marcel projections
```python
Expand Down
2 changes: 1 addition & 1 deletion pybaseballdatana/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
logging.basicConfig(format="%(levelname)s:%(name)s:%(module)s:%(message)s")
logger = logging.getLogger("pybaseballdatana")

_version = "0.1.1"
_version = "0.1.2"

PYBBDA_LOG_LEVEL_NAME = os.environ.get("PYBBDA_LOG_LEVEL", "")
_PYBBDA_LOG_LEVEL_MAP = {
Expand Down
8 changes: 4 additions & 4 deletions pybaseballdatana/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,10 @@

from pandas import Int32Dtype

from .sources.lahman.data import LahmanData
from .sources.baseball_reference.data import BaseballReferenceData
from .sources.retrosheet.data import RetrosheetData
from .sources.fangraphs.data import FangraphsData
from pybaseballdatana.data.sources.lahman.data import LahmanData
from pybaseballdatana.data.sources.baseball_reference.data import BaseballReferenceData
from pybaseballdatana.data.sources.retrosheet.data import RetrosheetData
from pybaseballdatana.data.sources.fangraphs.data import FangraphsData

nullable_int = Int32Dtype()

Expand Down
20 changes: 14 additions & 6 deletions pybaseballdatana/data/sources/retrosheet/_update.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import pathlib
import re

from tqdm import tqdm
import logging
from pybaseballdatana.data.sources.retrosheet.data import RetrosheetData

Expand Down Expand Up @@ -41,13 +42,22 @@ def _filter_event_files(event_files, min_year, max_year):
for event_file in event_files:
year = int(re.search("^([0-9]{4})", os.path.basename(event_file)).group(1))
if min_year <= year <= max_year:
logger.info("including file %s", event_file)
logger.debug("including file %s", event_file)
result.append(event_file)

return result


def _update(output_root=None, min_year=1871, max_year=2019, overwrite=False):
def _create_database(retrosheet_data, event_files):
logger.info(f"creating database with {len(event_files)} files")

retrosheet_data.create_database()
retrosheet_data.initialize_table(retrosheet_data.df_from_file(event_files[0]))
for event_file in tqdm(event_files[1:]):
retrosheet_data.update_table(retrosheet_data.df_from_file(event_file))


def _update(output_root=None, min_year=1871, max_year=2019, create_database=False):
output_root = output_root or pathlib.Path(__file__).parent.parent / "assets"
_validate_path(output_root)
retrosheet_data = RetrosheetData(output_root)
Expand All @@ -56,7 +66,5 @@ def _update(output_root=None, min_year=1871, max_year=2019, overwrite=False):
glob.glob(os.path.join(target, "event", "regular", "*EV*")), min_year, max_year
)

retrosheet_data.create_database()
retrosheet_data.initialize_table(retrosheet_data.df_from_file(event_files[0]))
for event_file in event_files[1:]:
retrosheet_data.update_table(retrosheet_data.df_from_file(event_file))
if create_database:
_create_database(retrosheet_data, event_files)
53 changes: 52 additions & 1 deletion pybaseballdatana/data/sources/retrosheet/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import psycopg2
import pandas as pd
import logging
import glob

from sqlalchemy import create_engine
from pychadwick.chadwick import Chadwick
Expand Down Expand Up @@ -32,6 +33,21 @@ def create_database(self):
os.makedirs(self.db_dir, exist_ok=True)
self._engine = create_engine(f"sqlite:///{self.db_path}", echo=False)

@property
def event_files(self):
return sorted(
glob.glob(
os.path.join(
self.data_root,
"retrosheet",
"retrosheet-master",
"event",
"regular",
"*EV*",
)
)
)

@property
def engine(self):
if not self._engine:
Expand All @@ -43,13 +59,48 @@ def initialize_table(self, df, conn=None):
df.to_sql("event", conn, index=False, if_exists="replace")

def update_table(self, df, conn=None):
logger.info("updating table with %s", df.GAME_ID.iloc[0])
logger.debug("updating table with %s", df.GAME_ID.iloc[0])
conn = conn or self.engine
df.to_sql("event", conn, index=False, if_exists="append")

def query(self, query):
return pd.read_sql_query(query, self.engine)

def df_from_team_id(self, team_id):
for suffix in ["EVA", "EVN"]:
event_file = os.path.join(
self.db_dir,
"retrosheet-master",
"event",
"regular",
f"{team_id}.{suffix}",
)
if os.path.exists(event_file):
logger.debug("loading events from %s", event_file)
return self.df_from_file(event_file)

found_remote_file = False
for suffix in ["EVA", "EVN"]:
event_url = os.path.join(
"https://raw.githubusercontent.com/"
"chadwickbureau/"
"retrosheet/"
"master/"
"event/"
"regular",
f"{team_id}.{suffix}",
)
logger.debug("loading event from URL %s", event_url)
try:
remote_df = self.df_from_file(event_url)
found_remote_file = True
return remote_df
except ValueError:
logger.debug(f"cannot find remote file {event_url}")

if not found_remote_file:
raise FileNotFoundError(f"cannot locate team_id {team_id}")

def df_from_file(self, file_path):
games = self.chadwick.games(file_path)
return self.chadwick.games_to_dataframe(games)
16 changes: 14 additions & 2 deletions pybaseballdatana/data/tools/update.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,12 @@ def _parse_args():
action="store_true",
help="Overwrite files if they exist",
)
parser.add_argument(
"--create-event-database",
required=False,
action="store_true",
help="Create a sqlite database for retrosheet event files",
)
parser.add_argument(
"--min-year",
required=False,
Expand All @@ -71,7 +77,9 @@ def _parse_args():
return parser.parse_args(sys.argv[1:])


def update_source(data_root, data_source, min_year, max_year, num_threads, overwrite):
def update_source(
data_root, data_source, min_year, max_year, num_threads, overwrite, create_database
):
if data_source == "Lahman":
update_lahman(data_root)
elif data_source == "BaseballReference":
Expand All @@ -86,7 +94,10 @@ def update_source(data_root, data_source, min_year, max_year, num_threads, overw
)
elif data_source == "retrosheet":
update_retrosheet(
data_root, min_year=min_year, max_year=max_year, overwrite=overwrite
data_root,
min_year=min_year,
max_year=max_year,
create_database=create_database,
)
else:
raise ValueError(data_source)
Expand Down Expand Up @@ -121,6 +132,7 @@ def main():
args.max_year,
args.num_threads,
args.overwrite,
args.create_event_database,
)


Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,4 @@ psycopg2==2.8.4
scipy==1.4.1
pychadwick==0.2.2
sqlalchemy==1.3.13
tqdm==4.46.1
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.1.1
current_version = 0.1.2

[flake8]
max-line-length = 90
Expand Down
10 changes: 9 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,16 @@

with open("README.md", "r") as fh:
long_description = fh.read()

def process_line(line):
return line.strip().split("=")[0]

with open("requirements.txt", "r") as fh:
install_requires = [process_line(line) for line in fh.readlines() if len(line)>1]

setup(
name="pybbda",
version="0.1.1",
version="0.1.2",
author="Ben Dilday",
author_email="ben.dilday.phd@gmail.com",
description="Baseball data and analysis in Python",
Expand All @@ -19,4 +26,5 @@
],
package_data={"pybaseballdatana": ["*.csv"]},
include_package_data=True,
install_requires=install_requires,
)
37 changes: 37 additions & 0 deletions tests/data/test_retrosheet/test_retrosheet.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
import glob
import pytest

from pybaseballdatana.data import RetrosheetData


@pytest.fixture
def retrosheet_data():
return RetrosheetData()


def test_retrosheet_data(retrosheet_data):
event_files = retrosheet_data.event_files
assert event_files

event_file = event_files[-1]
df = retrosheet_data.df_from_file(event_file)
nrow, ncol = df.shape
assert nrow > 1000
assert ncol == 159


def test_retrosheet_data_from_url(retrosheet_data):
_ = retrosheet_data.df_from_file(
"https://raw.githubusercontent.com/"
"chadwickbureau/retrosheet/master/event/regular/"
"1982OAK.EVA"
)


def test_retrosheet_data_from_team_id(retrosheet_data):
_ = retrosheet_data.df_from_team_id("1982OAK")


def test_retrosheet_data_from_team_id_missing(retrosheet_data):
with pytest.raises(FileNotFoundError):
_ = retrosheet_data.df_from_team_id("1870OAK")

0 comments on commit e1428e6

Please sign in to comment.