Skip to content

Latest commit

 

History

History
603 lines (433 loc) · 31.5 KB

README.md

File metadata and controls

603 lines (433 loc) · 31.5 KB

Static Badge      License: MIT      python

Introduction

Data Scientists and analysts have developed several metrics for determining a player's value to their team's success. Prominent examples include Value Over Replacement Player (VORP), Box Plus/Minus (BPM), and FiveThirtyEight's Robust Algorithm (using) Player Tracking (and) On/Off Ratings (RAPTOR)​. We aim to model and extract feature importance scores for such parameters based on how well they predict MVP rankings, then test our findings against unseen data for the most recent five seasons to see if we can correctly predict the MVP rankings.​ We will experiment with various machine learning approaches and compare our best result to other methods developed by reputable analyst sources.

Click on the Report dropdown menu below to learn about the data, experimental design, results, testing, and conclusions.

Report

Table of Contents

Data

We obtained the dataset from JK-Future, who originally scraped the data from Basketball-Reference via automated HTML parsing. The dataset contains statistics for National Basketball Association (NBA) players relevant to determining the Most Valuable Player (MVP) in a season and has 7,329 entries with 53 columns. The dataset is significant in its breadth and depth of coverage.

We store the dataset in mvp_data.csv and load it into DataCleaning_EDA.ipynb, where we perform data cleaning and aggregation.

Click here for details about how we cleaned the data
  • Fill missing values for the Rank, mvp_share, and Trp Dbl (Triple Double) columns
  • Normalize the Trp Dbl column by dividing it by G (the total number of games played in a given season)
  • Convert G (Games) and Season columns to integer data type
  • Filter the entire data frame (df) to include only players that meet the 40-game requirement necessary to be considered for the MVP award
  • Create the Rk_Conf (Conference Ranking) column – calculate conference rankings for each season based on W (the number of wins), then re-rank the conference rankings within each season and conference group
  • Save the edited data frame thus far to mvp_data_edit.csv (we use this in Test.ipynb to merge predicted values with actual and compare results)
  • Drop the Conference and W (Wins) columns
  • Create a separate data frame (df_last) with the data for the most recent five seasons (2018–22), which we use to test our final model and index
  • Check for missing values: We found many missing values for seasons before 1980; for example, 3P (Three-pointers) were not introduced in the NBA until 1979–80, and there are a lot of missing values before then, so we drop any season before 1980
  • Save df and df_last to comma-separated Excel files

We discuss additional preprocessing steps in the Experimental Design section below, as these steps relate to the project's feature selection and modeling phases.

The values we seek to predict are in the mvp_share column, representing each season's MVP voting result.

Experimental Design

Click here for details about our hardware and compute resources

We use Rivanna – the University of Virginia's High-Performance Computing (HPC) system – with the following hardware details:

  • System: Linux
  • Release: 4.18.0-425.10.1.el8_7.x86_64
  • Machine: x86_64
  • CPU Cores: 28
  • RAM: 36GB
  • CPU Vendor: AuthenticAMD
  • CPU Model: AMD EPYC 7742 64-Core Processor

Design Overview

Below is an overview of the steps to gather the index values and model results. We detail these steps further in the Feature Selection Process, Modeling, Results, and Testing sections that follow.

Feature Selection Process

In FeatureSelection.ipynb, we load in df_clean.csv as a Pandas DataFrame (df) and perform robust feature selection using the preprocess_and_train function from preptrain.py. The preprocess_and_train function serves to:

  • Impute missing values with the median value for numeric features, scale the features using standardization (subtracting the mean and dividing by the standard deviation) and apply one-hot encoding for categorical features.

  • Apply the preprocessing separately to the training and testing datasets and extract the feature names, removing any prefixes.

  • Train and test eight different models on the preprocessed data and extract the feature importance scores of the top ten predictors. The models are:

    • Random Forest (RF)
    • Decision Tree (DTree)
    • Principal Component Analysis (PCA)
    • Gradient Boosting (GB)
    • Support Vector (SVR)
    • Extra Trees (XTrees)
    • AdaBoost (Ada)
    • Extreme Gradient Boosting (XGB)

For hyperparameter tuning, we define a reasonably extensive parameter grid for each method and use Bayesian optimization with five-fold cross-validation to sample parameter settings from the specified distributions.

We set the n_jobs parameter to $-1$ in the BayesSearchCV initialization, instructing scikit-learn to use all available CPU cores during cross-validation. Thus, each fold's training and evaluation are executed concurrently on different CPU cores, reducing the overall time taken for cross-validation. This parallelization strategy helps to decrease the overall time required for cross-validation, which is particularly beneficial for speeding up the hyperparameter search process.

After running the preprocess_and_train function, we use the print_dict_imps function from helper_functions.py to print tables of the feature importances for each method, which the preprocess_and_train function stores in a Python dictionary. We then use the avg_imp function from helper_functions.py to display the average feature importance across the eight methods.

Please refer to the Results section below to see the results of the feature selection process.

Modeling

In Models.ipynb, we use the train_models function from modeling.py to train and test only the ensemble and tree-based methods, as these are best suited for our next task — finding the best model we can and using the feature importance scores to inform our index design.

In Test.ipynb, we load in the selected features, the training dataset, the testing dataset containing the data for the 2018–22 seasons, and the best model from Models.ipynb. We filter the training and testing data to include only the selected features.

We then call the evaluate_model function from helper_functions.py to retrain the best model and predict the mvp_share for the 2018–22 seasons. We then compare the predicted values to the actual values.

The Results section below discusses the results from our feature selection and modeling processes, and the Testing section contains results from testing our best model and index.

Results

The feature selection process originally produced a set of ten highly correlated features, the most correlated of which relate to scoring, as displayed below in the correlation heatmap:

Points (PTS) captures all of these, for the most part, so we drop FTA, FGA, FG, 2P, and FT. We also dropped weight, as it only appeared once.

Moving to the next-highest features in terms of importance, we get:

  1. MP
  2. PTS
  3. PER
  4. VORP
  5. WS
  6. TOV
  7. FG%
  8. STL%
  9. BPM
  10. Rk_Conf

FG% is also highly correlated with PTS, so we also drop that feature. This brings in BPM, which is highly correlated with OBPM, so we drop the latter in favor of the former since it captures both OBPM and DBPM. In replacing FG%, we now look at the next candidate features, DWS and OWS, which are correlated with WS, so we do not include those. The next option is Rk_Year, which is highly correlated with Rk_Conf and likely captures more than just conference ranking, so we include Rk_Year instead of Rk_Conf. Finally, we get AST%. So, our final set of ten features is:

  1. MP = Minutes Played
  2. PTS = Points
  3. PER = Player Efficiency Rating (see Calculating PER for the formula)
  4. VORP = Value Over Replacement Player
  5. WS = Win Shares (see NBA Win Shares) for information about how this feature is calculated)
  6. TOV = Turnovers
  7. STL% = Steal percentage
  8. BPM = Box Plus-Minus
  9. Rk_Year = Team Ranking
  10. AST% = Assist percentage

There are still some highly correlated features, but we proceeded with these ten and saved them to df_selected.csv to use for modeling.

We feed these ten features into the train_models function, which returns several key pieces of information, including the best model. The train_models function also displays neat tables of the feature importance values from each model and a model performance bar chart, as displayed below:

The chart shows that the best model is the Extra Trees Regressor (XTrees), which the train_models function saves to best_model.pkl using the joblib library.

Testing

We import the best model into Test.ipynb to perform testing on the unseen data.

The chart below displays the predicted values from the best model compared to the actual values; the orange markers represent the predicted value, and the dark blue markers represent the actual value:

The range plot shows that the predicted values for mvp_share are not wildly off for the top four candidates for the 2018–22 seasons. There are some player-year combinations (Damian Lillard, 2018; Nikola Jokic, 2019; and James Harden, 2020) for which the predicted value is very close to the actual.

The table below shows whether the model correctly predicted the top four rankings for the 2018–22 seasons; the model accurately predicts which players are in the top four each season but doesn't always order them correctly.

The model accurately predicts the MVP for each of the five seasons in the test set. The predictions for the 2018 season were perfect in terms of ranking, but the model's rankings for the next four seasons are slightly off. The rankings for 1st and 2nd for the 2019 season are correct, but the model swaps the 3rd and 4th place candidates. For the 2020 season, the model correctly ranks the 1st and 4th place candidates but swaps 2nd and 3rd place. The model correctly ranks the 1st and 2nd place candidates for the 2021 season but places 3rd and 4th out of order. For the 2022 season, the model incorrectly ranks the 2nd and 3rd place candidates but correctly ranks 1st and 4th.

Conclusions

The initial feature selection and index calculating model performs well at predicting the 1st-place winner of the award but struggles to predict the runner-ups. This is likely because no distribution is associated with receiving a certain number of runner-up votes in the model. When 100 media members vote on who should win the award, they rank their top 5, with each player receiving points based on how many 1st, 2nd, 3rd, 4th, and 5th placed votes they receive. The following is how many points each place vote is worth:

  • 1st = 10
  • 2nd = 7
  • 3rd = 5
  • 4th = 3
  • 5th = 1

The player with the most aggregate points wins. This point distribution demonstrates why the model can predict 1st place but can't allocate a point total for the remaining four places. For future enhancements, the model should predict a complete voting distribution for at least the top 10 players in consideration for the award.

The criteria for MVP voting have changed over time, meaning any advanced metric used to predict who should win the award will constantly need to be adjusted and taken with a grain of salt. For example, if we look at the most recent MVP race of the 2022–23 season, Joel Embiid won the award over Nikola Jokic even though almost every other advanced aggregate stat (i.e., WS, VORP, and RAPTOR) favored Jokic. There are a few notable reasons for this, the main two being voter fatigue and narrative. Voter fatigue refers to how media voters avoid picking the same player to win the award many consecutive times, even if they are the most deserving. Having won two straight years, Jokic may have been hampered by this bias from voters while going for his third consecutive award. As for narrative, this is where the crux of our project lies.

Our goal, along with that of FiveThirtyEight and other data science organizations, was to add objectivity to an inherently subjective topic. Voting is subjective and can be skewed by unaccountable bias and narrative. Along with voter fatigue, media members thought about an era in which Giannis Antetokounmpo, Nikola Jokic, and Joel Embiid constantly placed in the top 5 for voting, yet Embiid never ended up being one of the winners. The narrative of "Jokic or Giannis win again" didn't sit too well in NBA discourse, so naturally, journalists would prefer to have something new to write about. Of course, we don't know every voter's thought process, so all we can do is play with the data and try our best to reach an objective conclusion.

Minimal Reproducible Code

The dropdown menus below contain minimal reproducible code for each of the Jupyter Notebooks:

DataCleaning_EDA

##################################
### Import necessary libraries ###
##################################
import pandas as pd
import numpy as np
import os
os.chdir('...')

#################
### Load data ###
#################
df = pd.read_csv('mvp_data.csv')

#############
### Clean ###
#############
# Fill missing values
df['Rank'].fillna(0, inplace=True)
df['mvp_share'].fillna(0.0, inplace=True)
df['Trp Dbl'].fillna(0, inplace=True)

# Normalize Triple Double
df['Trp Dbl'] = df['Trp Dbl'] / df['G']

# Convert 'G' and 'Season' to integer type
df['G'] = df['G'].astype(int)
df['Season'] = df['Season'].astype(int)

# Filter out data based on conditions
df = df[(df['G'] > 40) & (df['Season'] <= 2022)]

# Ranking Conference
df['Rk_Conf'] = df.groupby(['Season', 'conference'])['W'].rank("dense", ascending=False) + df['Rk_Year']
df['Rk_Conf'] = df.groupby(['Season', 'conference'])['Rk_Conf'].rank("dense", ascending=True)

# Create mvp_data_edit.csv
df.to_csv("mvp_data_edit.csv", index=False, encoding="utf-8-sig")

# Drop Wins and Conference
df.drop(columns=['conference', 'W'], inplace=True)

# Sort out seasons we'll use for testing/predictions
df.sort_values(by=['Season'], ascending=False, inplace=True)
df_last = df[df['Season'] > (2022 - 5)] 

# Filter for seasons older than 5 years
df = df[df['Season'] <= (2022 - 5)]
df.drop(columns=['name'], inplace=True)

# Filter seasons to 1980 and after
df = df[df['Season'] >= 1980]
df.drop(['Season'], axis="columns", inplace=True)

# Save training and test data to .csv files
df.to_csv('df_clean.csv', index=False)
df_last.to_csv('df_last.csv', index=False)

FeatureSelection

##################################
### Import necessary libraries ###
##################################
import pandas as pd
import numpy as np
import os
os.chdir('...')
from preptrain import preprocess_and_train
from helper_functions import (print_dict_imps4x2, 
                              avg_imps, 
                              plot_corr_heatmap)

#################
### Load data ###
#################
df = pd.read_csv('df_clean.csv')
df_last = pd.read_csv('df_last.csv')
labels = df.pop("mvp_share")
stratify = df.pop("Rank")

###########################################################
###                 FEATURE SELECTION                   ###
###   preprocess_and_train function from preptrain.py   ###
###########################################################
(features_rf,
 features_Dtree,
 features_pca, 
 features_gbm,
 features_svr, 
 features_Xtrees,
 features_Ada,
 features_XGB,
 feature_importances) = preprocess_and_train(df, df_last, labels)

# Call function to print top 10 features
avg_imp = avg_imps(feature_importances)

# Save selected features to a list
selected_features = ['MP', 'PTS', 'PER', 'VORP', 'WS', 'TOV', 'STL%', 'BPM', 'Rk_Year', 'AST%']

# Filter training data to only selected features
df_selected = df[selected_features]

# Save to a separate .csv file for modeling
df_selected.to_csv('df_selected.csv', index=False)

Models

##################################
### Import necessary libraries ###
##################################
import pandas as pd
import os
os.chdir('...')
from modeling import train_models
from helper_functions import get_hardware_details

#################
### Load data ###
#################
df = pd.read_csv('df_clean.csv')
labels = df.pop("mvp_share")
df_selected = pd.read_csv('df_selected.csv')
feature_names = list(df_selected.columns)

##################################################
###               FIND BEST MODEL              ###
###   train_models function from modeling.py   ###
##################################################
trained_models, results, best_model_name, best_model = train_models(df_selected,
                                                                    df,
                                                                    labels,
                                                                    feature_names,
                                                                    label_col_name="mvp_share")

Test

##################################
### Import necessary libraries ###
##################################
import numpy as np
import pandas as pd
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split

import os
os.chdir('...')
from helper_functions import (print_importances, 
                              print_dict_imps, 
                              avg_imps, 
                              percent_formatter, 
                              plot_comparison_for_season)

import joblib
best_model = joblib.load('best_model.pkl')

#################
### Load data ###
#################
df_selected = pd.read_csv('df_selected.csv')
features = list(df_selected.columns) + ['mvp_share', 'Rank']
df_train = pd.read_csv('df_clean.csv', usecols=features)
labels = df_train.pop('mvp_share')
stratify = df_train.pop('Rank')
del features[10:12]
features.extend(['Season', 'name'])
df_test = pd.read_csv('df_last.csv', usecols=features)
df_test.rename(columns={'name': 'Name'}, inplace=True)
del features[10:12]

#############################################################
###                  RETRAIN AND TEST                     ###
###    evaluate_model function from helper_functions.py   ###
#############################################################
(X_train, 
 X_test, 
 y_train, 
 y_test, 
 merged_df) = evaluate_model(best_model, 
                             df_train, 
                             labels, 
                             df_test, 
                             features,
                             stratify)

#########################
### Visualize results ###
#########################
# Call function to plot predicted vs. actual
plot_comparison_for_season(merged_df, 2022)
plot_comparison_for_season(merged_df, 2021)
plot_comparison_for_season(merged_df, 2020)
plot_comparison_for_season(merged_df, 2019)
plot_comparison_for_season(merged_df, 2018)

###########################################################
### Create df_results for interactive viz in QuickSight ###
###########################################################
# Load full dataset from Cleaning_EDA.ipynb
df_results = pd.read_csv('mvp_data_edit.csv')

# Filter columns and seasons
df_results = df_results.drop(columns=['conference', 'W']).query('Season >= 1980')

# Calculate and add the index column
feature_importances = best_model.feature_importances_
normalized_importances = feature_importances / np.sum(feature_importances)
index_values = np.dot(df_results[features].values, normalized_importances)
df_results['index'] = index_values

# Rank the index within each season group
df_results['Ranked_Index'] = df_results.groupby('Season')['index'].rank(ascending=False)

# Save to a separate csv
df_results.to_csv('results.csv', index=False)

Manifest

Jupyter Notebooks

  • Feature Selection notebook where we use the preprocess_and_train function from preptrain.py and ensemble the methods to generate the best 10 features.

  • Exploratory notebook where the data is cleaned; includes some basic EDA.

  • Modeling notebook where we use the selected features (from df_selected.csv) to train and evaluate a range of models and extract their feature importance. These results will inform how we weight features in the index.

  • This notebook contains the code where we test our best model (from Models.ipynb) against the last five seasons. We include some visualizations showing the model prediction versus the actual values.

Data Files

Python Modules (helper functions, classes)

  • Custom function/pipeline for preprocessing and feature selection.

  • Custom function/pipeline to train the ensemble and tree-based models and extract the best model.

This module contains various helper functions for system information retrieval, model evaluation, and visualization.

Click here to see the helper functions
  • get_hardware_details():

    Retrieve basic hardware details of the system.

  • print_importances(features, model):

    Print the feature importances of a model.

  • print_dict_imps(feature_importances):

    Print the feature importances in a visually appealing table format side-by-side.

  • avg_imps(feature_importances):

    Calculate the average feature importances across different methods.

  • create_imp_df(model_names, models, feature_names):

    Create a DataFrame of feature importances for each model.

  • plot_corr_heatmap(corr_matrix, selected_feature_names, threshold=0.65, width=7, height=4, show_vals=True):

    Plot a correlation heatmap for selected features.

  • plot_model_performance(model_names, r_sqs, MSE_s):

    Plot the R-squared and MSE values of different regression models.

  • plot_comparison_for_season(df, season)

    Plot the actual vs. predicted mvp_share values.

  • evaluate_model(best_model, df_train, labels, df_test, features, stratify)

    Evaluate the best model and generate predictions.

Other Files

The images folder contains various visualizations and images used in the README.md

The README.md file includes the repository description and the report.

This file includes all of the necessary libraries and versions for running our code.

This file contains the best model from Models.ipynb.