GitHub | Kaggle | LinkedIn
Author: Declan Costello
Table of Contents
Welcome to my analysis of the 2011 TOUR Championship at East Lake Golf Club, the primary objective of this project is to:
Develop an expected strokes model to identify player performance
I hope to contribute meaningful insights to the golf community through this project. Although the 2011 TOUR Championship took place over a decade ago and the tournament's rules have since changed, its extensive shot-level dataset remains a valuable resource. If you happen to come across another complete shot-level dataset, I would greatly appreciate it if you could share it with me! I encourage you to check out the js visuals on NBViewer and Streamlit Dashboard!
This repo is organized as follows:
π TOUR-Championship-Strokes-Gained-Analysis π
β
βββ π Data
βββ CITATION
βββ README.md
βββ CODE_OF_CONDUCT.md
β
βββ π EDA
β βββ EDA.ipynb
β βββ SGperHole.ipynb
β βββ SGperRound.ipynb
β βββ SGperDrive.ipynb
β βββ FeatureEngineering.ipynb
β βββ π EDAUtils
β
βββ π Creating Model
β βββ LazyPredict.ipynb
β βββ PuttingModel.ipynb
β βββ ApproachModel.ipynb
β βββ π OptimizingUtils
β
βββ π Applying Model
β βββ SGCreation.ipynb
β βββ SGAnalysis.ipynb
β
βββ π Streamlit Dashboard
In this project, a Security Linter, Code Formatting, Type Checking, and Code Linting are essential for ensuring code quality and robustness. These help identify and mitigate security vulnerabilities, maintain consistent coding styles, enforce type safety, and detect potential errors or issues early in the development process, ultimately enhancing the reliability and maintainability of the project.
Security Linter | Code Formatting | Type Checking | Code Linting |
---|---|---|---|
bandit |
ruff-format |
mypy |
ruff |
This dataset consists of shot level data from the PGA TOUR Championship. The TOUR Championship differs from other tournaments in that only the top 30 golfers compete and there's no cut after the second round, this ensures consistent data of high skill golfers across all 4 rounds. Additionally, it's important to acknowledge that the dataset lacks data from the playoff that occurred, which is crucial for understanding the tournament's conclusion. Furthermore, it is important to emphasize that landing in the rough at East Lake doesn't necessarily disadvantage a player. Despite the challenge it presents, the ball could still have a favorable lie, which might have been strategically chosen by the golfer.
I analyze the data, focusing on feature engineering to understand, clean, and refine the dataset. This process guides model selection and validates assumptions, while also uncovering insights through visualization. By addressing data quality and recognizing patterns early on, I establish a solid foundation for the project. For instance, exploring Strokes Gained (SG) at the round, hole, and drive levels helps us make assumptions for building a model to examine SG on a shot-level basis later.
I analyze the Strokes Gained distribution for each round of the Championship, revealing player performance trends during the tournament. This examination on a round-by-round basis helps uncover patterns in golfers' strategies and identifies challenges posed by difficult pin locations on the course.
- All rounds have a promising mean of 0
- Round 3 seemed to be the most chaotic, as there was a significant variance in player performance throughout the day
In this analysis, I investigate the distribution of Strokes Gained for each hole of every round of the Championship. Notably, Mahan ties Haas in Strokes Gained on the 72nd hole, a significant moment in the tournament. However, Haas ultimately secured victory in the playoffs!
- Players appear to continue to play relative to their initial performance of round 1
- Poorly performing players seem to give up come the back 9 of round 3
Here I explore the distribution of Strokes Gained vs Driving Distance Gained and Driving Accuracy Gained for each drive of the Championship. Both Driving Distance and Driving Accuracy are normalized per hole before totalling. Happy to say my analysis aligns with Data Golf's Course Fit Tool.
- Driving Accuracy has a strong correlation to Strokes Gained per Hole
- Driving Distance has only a slight correlation to Strokes Gained per Hole
The Stacked Expected Strokes Model leverages the power of ensemble learning by combining predictions from multiple base models to enhance accuracy and robustness. Notably, I've developed separate models for putting and approach scenarios, utilizing different input features tailored to each situation. This approach allows for more precise predictions by optimizing the model's focus on specific aspects of gameplay, ultimately leading to improved performance and insights in golf analytics. Furthermore, this model will eventually enable a granular analysis of shot-by-shot Strokes Gained, a significant departure from previous hole-by-hole and round-by-round evaluations. By harnessing the Stacked Expected Strokes Model's predictive capabilities, I'll unlock the ability to evaluate each shot's impact on overall performance, offering unprecedented insights into golfer performance. Additionally, I'm unconcerned about data leakage since I'll be predicting continuous variables while training on discrete data, ensuring the model's integrity and effectiveness in real-world applications.
While the training data is discrete, for continuous predictions, I faced the task of selecting between regression models. As with all my models, I was sure to stratify the training and testing data before predicting. Initially, I employed lazy predict to assess various model options comprehensively.
- The GradientBoostingRegressor and HistGradientBoostingRegressor models performed the best
- If I were to have to constantly retrain the model I would avoid the MLPRegressor as it takes forever
Model | Adjusted R-Squared | R-Squared | RMSE | Time Taken |
---|---|---|---|---|
GradientBoostingRegressor | 0.85 | 0.85 | 0.46 | 0.93 |
HistGradientBoostingRegressor | 0.85 | 0.85 | 0.46 | 0.60 |
LGBMRegressor | 0.85 | 0.85 | 0.47 | 0.14 |
MLPRegressor | 0.84 | 0.84 | 0.48 | 5.23 |
KNeighborsRegressor | 0.82 | 0.83 | 0.50 | 0.16 |
AdaBoostRegressor | 0.82 | 0.83 | 0.50 | 0.49 |
RandomForestRegressor | 0.82 | 0.82 | 0.50 | 3.46 |
XGBRegressor | 0.82 | 0.82 | 0.50 | 0.24 |
BaggingRegressor | 0.81 | 0.81 | 0.52 | 0.37 |
NuSVR | 0.81 | 0.81 | 0.52 | 3.58 |
ExtraTreesRegressor | 0.80 | 0.80 | 0.53 | 2.02 |
SVR | 0.80 | 0.80 | 0.53 | 3.35 |
After finding the top performing models, I ensemble the best models together using a stack. In this project, I leveraged optuna's CMAES Sampler to not only find the best parameters for each model in the stack resulting in minimized MAE, but also data preprocessing scalers, encoders, imputation, and feature selection methods. All trials are fed with appropriate offline training data from a feast feature store. I utilized an mlflow model registry to track all Optuna trials. Databricks is leveraged to store production ready models. Finally, I wrapped this whole tuning process in a Poetry wheel file called 'OptimizingUtils' for reproducibility.
I attempted to prevent bias by stratifying my training data and by using nested cross stratified split validation to prune biased trials. I plan to go a step further by bootstrapping, implementing imbalanced learning libraries, and exploring Optuna's terminator, distribution, auto sampler, and multiObjectiveStudy feautres. I evaluate model bias that still occurred with shap and lime, enriching our understanding of the model's predictive behavior. Below, you'll find a shap chart for the putting model's LGBMRegressor.
- Super surprised to see "Distance to Edge" matters more than "Distance to Pin" for putting, curious if this would be the case if I had a larger dataset
- "Downhill Slope" and "Elevation Below Ball" are distinct features; Despite their seemingly similar title, they are not the same. To confirm this, a pairwise correlation was done
This chart helps evaluate the model by showing how predicted values compare to actual ones and revealing patterns in prediction errors. The histogram below assesses if errors follow a normal distribution, crucial for reliable predictions.
- Excited to see the residuals have a low standard deviation with a mean hovering around 0
Now that we have a stacked SG machine learning model for a shot per shot basis, we can use it to gain valuable insights into golfer performance. Utilizing the model post-training enables golf analysts, coaches, and players to extract actionable insights, optimize strategies, and refine skills. Ultimately, leveraging a model empowers stakeholders to make informed decisions, enhance performance, and drive success on the golf course.
Now that we have a reliable model, we can use it to identify a player's strengths and weaknesses by subtracting Expected Strokes (xS) from the result of each shot to give us true Strokes Gained (SG). The plots below display Woodlands's Total SG and SG Percentile by shot type, providing a clear visualization of his performance across different lies and distances.
- Woodland was very successful gaining strokes on the green
- By looking at Woodland's SG Percentile, we can see that he truly underperformed from 200+ yards out, opposed to having one or two shots damage his 200+ SG Total
- Woodland only had six shots within 100-50 yards of the pin, perhaps this was by design to avoid putting himself in a position where he consistently underperforms
Looking back, I wish I had known about Strokes Gained during my time as a caddy. I've come to understand that Strokes Gained provides a more accurate reflection of performance on the hole, while SG Percentiles based on shot location offer deeper insights into a golfer's true abilities. I'm excited to explore more golf-related projects in the future.
- Model Refinement
- External Data
- Player Course History
- Career Earnings
- Equipment
- Biometrics
- Weather
- SVGs
- HCP
- Bayesian Integration