Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality of life improvements #6

Merged
merged 3 commits into from
Jul 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 38 additions & 31 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: Build pyball Python package

on:
Expand All @@ -11,38 +8,48 @@ on:

jobs:
build:

runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: [ "3.12" ]

steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest build poetry
poetry install
poetry run post-install
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
if [ -d tests ]; then poetry run pytest; fi
- name: Build package
run: python -m build
- name: Upload build artifacts
uses: actions/upload-artifact@v4
with:
name: pyball-package-${{ matrix.python-version }}-${{ github.run_number }}
path: dist/*
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- uses: actions/cache@v3
with:
path: |
~/.cache/pip
~/.cache/pypoetry
key: ${{ runner.os }}-poetry-${{ hashFiles('**/poetry.lock') }}
restore-keys: |
${{ runner.os }}-poetry-
- name: Install Poetry
uses: snok/install-poetry@v1
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest build
poetry install
- name: Install Playwright
run: poetry run playwright install
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
if [ -d tests ]; then poetry run pytest; fi
- name: Build package
run: python -m build
- name: Upload build artifacts
uses: actions/upload-artifact@v4
with:
name: pyball-package-${{ matrix.python-version }}-${{ github.run_number }}
path: dist/*
11 changes: 4 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,6 @@ Library for grabbing baseball statistics in Python, designed for use in Jupyter

Another python library for getting baseball statistics already exists ([pybaseball](https://github.com/jldbc/pybaseball)), however, pyball just provides barebones functions for retriving stats from Baseball-Reference, and Baseball Savant.

## Requirements
- Python 3.10.12

## Install/Build From Source
```
git clone https://github.com/gdifiore/pyball.git
Expand All @@ -21,7 +18,7 @@ python -m venv .venv

poetry install

poetry run post-install
playwright install
```

## Docs
Expand All @@ -34,6 +31,6 @@ Leave any comments or suggestions in [an issue](https://github.com/SummitCode/py

`pyball` is licensed under the [MIT license](https://github.com/SummitCode/pyball/blob/master/LICENSE)

## TODO
- update documentation
- refactor into classes
## To-do
- I think the cache is broken? Or the lookup is slow, investigate.
- Would like to make a base class of shared functions (_get_soup(), _find_table(), ...) but I kinda hate how python classes work.
9 changes: 0 additions & 9 deletions pyball/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,3 @@
"""
This is the pyball module.
"""

import subprocess


def post_install():
"""
Run the playwright install command.
"""
subprocess.run(["playwright", "install"], check=True)
24 changes: 20 additions & 4 deletions pyball/baseball_reference_player.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ def __init__(self, url: str):
raise ValueError(f"Invalid player URL: {url}")
self.url = url
self.soup = self._get_soup()
if self.soup is None:
logger.warning("Failed to retrieve content from URL: %s", self.url)

def _get_soup(self) -> Optional[BeautifulSoup]:
"""
Expand Down Expand Up @@ -90,6 +92,17 @@ def _find_table(self, table_id: str) -> Optional[BeautifulSoup]:
"""
return self.soup.find("table", id=self.TABLE_IDS[table_id])

def _parse_table(self, table: BeautifulSoup):
rows = []
for row in table.find_all('tr'):
# Check if the row has the 'hidden' class
if 'hidden' not in row.get('class', []):
# Process the row only if it's not hidden
cells = row.find_all(['th', 'td'])
rows.append([cell.text.strip() for cell in cells])

return rows

def _get_dataframe(self, table_id: str) -> Optional[pd.DataFrame]:
"""
Parses the HTML table and returns it as a pandas DataFrame.
Expand All @@ -110,11 +123,14 @@ def _get_dataframe(self, table_id: str) -> Optional[pd.DataFrame]:
return None

try:
df = pd.read_html(str(table))[0]
rows = self._parse_table(table)
if not rows:
logger.warning("No visible rows found in %s stats table (not an MLB player?)", table_id)
return None

# Create DataFrame directly from the parsed rows
df = pd.DataFrame(rows[1:], columns=rows[0])
return df.dropna(how="all")
except ValueError as e:
logger.error("Error parsing %s stats table (no tables found): %s", table_id, str(e))
return None
except Exception as e:
logger.error("Error parsing %s stats table: %s", table_id, str(e))
return None
Expand Down
3 changes: 3 additions & 0 deletions pyball/baseball_reference_team.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ def __init__(self, url: str):
raise ValueError(f"Invalid team URL: {url}")
self.url = url
self.soup = self._get_soup()
if self.soup is None:
logger.warning("Failed to retrieve content from URL: %s", self.url)

def _get_soup(self) -> Optional[BeautifulSoup]:
"""
Expand Down Expand Up @@ -114,6 +116,7 @@ def _get_dataframe(self, table_id: str) -> Optional[pd.DataFrame]:

try:
df = pd.read_html(str(table))[0]
df = df.iloc[:-1]
return df.dropna(how="all")
except ValueError as e:
logger.error("Error parsing %s stats table (no tables found): %s", table_id, str(e))
Expand Down
39 changes: 23 additions & 16 deletions pyball/savant.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import pandas as pd
from bs4 import BeautifulSoup

from pyball.utils import read_url
from pyball.utils import read_url, is_savant_url

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -43,11 +43,11 @@ class SavantScraper:
"""

TABLE_IDS = {
'percentile': 'percentileRankings',
'pitching': 'statcast_stats_pitching',
'batting': 'statcast_glance_batter',
'batted_ball': 'playeDiscipline',
'pitch_tracking': 'detailedPitches'
"percentile": "percentileRankings",
"pitching": "statcast_stats_pitching",
"batting": "statcast_glance_batter",
"batted_ball": "playeDiscipline",
"pitch_tracking": "detailedPitches",
}

def __init__(self, url: str):
Expand All @@ -59,6 +59,8 @@ def __init__(self, url: str):
url : str
The URL of the Baseball Savant page to scrape.
"""
if not is_savant_url(url):
raise ValueError(f"Invalid team URL: {url}")
self.url = url
self.soup = self._get_soup()
if self.soup is None:
Expand All @@ -71,7 +73,8 @@ def _get_soup(self) -> Optional[BeautifulSoup]:
Returns:
--------
BeautifulSoup or None
The BeautifulSoup object representing the HTML content of the URL, or None if retrieval failed.
The BeautifulSoup object representing the HTML content of the URL,
or None if retrieval failed.
"""
soup = read_url(self.url)
if soup is None:
Expand Down Expand Up @@ -99,7 +102,11 @@ def _find_table(self, table_id: str) -> Optional[BeautifulSoup]:
if div is not None:
table = div.find("table")
if table is None:
logger.warning("Table with id '%s' not found for URL: %s", self.TABLE_IDS[table_id], self.url)
logger.warning(
"Table with id '%s' not found for URL: %s. Is the player the right position?",
self.TABLE_IDS[table_id],
self.url,
)
return table

def _get_dataframe(self, table_id: str) -> Optional[pd.DataFrame]:
Expand All @@ -122,11 +129,11 @@ def _get_dataframe(self, table_id: str) -> Optional[pd.DataFrame]:
try:
df = pd.read_html(str(table))[0]
df = df.dropna(how="all")
if table_id in ['pitching', 'batting'] and not df.empty:
df = df.drop(df.index[-1]) # drop last row of MLB average
return df
except ValueError as e:
logger.error("Error parsing %s table (no tables found): %s", table_id, str(e))
logger.error(
"Error parsing %s table (no tables found): %s", table_id, str(e)
)
return None
except Exception as e:
logger.error("Unexpected error parsing %s table: %s", table_id, str(e))
Expand All @@ -141,7 +148,7 @@ def get_percentile_stats(self) -> Optional[pd.DataFrame]:
pandas.DataFrame or None
Contains the percentile stats for the player, or None if not found.
"""
return self._get_dataframe('percentile')
return self._get_dataframe("percentile")

def get_pitching_stats(self) -> Optional[pd.DataFrame]:
"""
Expand All @@ -152,7 +159,7 @@ def get_pitching_stats(self) -> Optional[pd.DataFrame]:
pandas.DataFrame or None
Contains the savant pitching stats for the player, or None if not found.
"""
return self._get_dataframe('pitching')
return self._get_dataframe("pitching")

def get_batting_stats(self) -> Optional[pd.DataFrame]:
"""
Expand All @@ -163,7 +170,7 @@ def get_batting_stats(self) -> Optional[pd.DataFrame]:
pandas.DataFrame or None
Contains the savant batting stats for the player, or None if not found.
"""
return self._get_dataframe('batting')
return self._get_dataframe("batting")

def get_batted_ball_profile(self) -> Optional[pd.DataFrame]:
"""
Expand All @@ -174,7 +181,7 @@ def get_batted_ball_profile(self) -> Optional[pd.DataFrame]:
pandas.DataFrame or None
Contains the batted ball profile for the player, or None if not found.
"""
return self._get_dataframe('batted_ball')
return self._get_dataframe("batted_ball")

def get_pitch_tracking(self) -> Optional[pd.DataFrame]:
"""
Expand All @@ -185,4 +192,4 @@ def get_pitch_tracking(self) -> Optional[pd.DataFrame]:
pandas.DataFrame or None
Contains the pitch-specific results for the player, or None if not found.
"""
return self._get_dataframe('pitch_tracking')
return self._get_dataframe("pitch_tracking")
16 changes: 14 additions & 2 deletions pyball/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ def is_bbref_player_url(url):
url (str): The URL to check.

Returns:
bool: True if the URL contains 'players', False otherwise.
bool: True if the URL contains 'players' and 'baseball-reference', False otherwise.
"""
return "players" in url and "baseball-reference" in url

Expand Down Expand Up @@ -141,7 +141,7 @@ def is_bbref_team_url(url):
url (str): The URL to check.

Returns:
bool: True if the URL contains 'teams', False otherwise.
bool: True if the URL contains 'teams' and 'baseball-reference', False otherwise.
"""
return "teams" in url and "baseball-reference" in url

Expand All @@ -168,3 +168,15 @@ def make_savant_player_url(last, first, key_mlbam):
url = base_url + first + "-" + last + "-" + key_mlbam

return url

def is_savant_url(url):
"""
Checks if the given string is a valid Baseball Savant url.

Args:
url (str): The URL to check.

Returns:
bool: True if the URL contains 'baseballsavant', False otherwise.
"""
return "baseballsavant" in url
5 changes: 1 addition & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "poetry.core.masonry.api"

[tool.poetry]
name = "pyball"
version = "1.4.0"
version = "1.4.1"
description = "Python3 library for obtaining baseball information"
authors = ["gdifiore"]
readme = "README.md"
Expand All @@ -19,9 +19,6 @@ requests = "^2.26.0"
playwright = "^1.45.0"
lxml = "^5.2.2"

[tool.poetry.scripts]
post-install = "pyball:post_install"

[tool.poetry.group.dev.dependencies]
pytest = "^8.3.1"
mock = "^5.1.0"
Loading