machine-learning.qmd

# Machine Learning {#sec-machineLearning}

## Getting Started {#sec-machineLearningGettingStarted}

### Load Packages {#sec-machineLearningLoadPackages}

```{r}
library("petersenlab")
library("parallel")
library("doParallel")
library("missRanger")
library("powerjoin")
library("caret")
library("gpboost")
library("tidyverse")
```

### Load Data {#sec-machineLearningLoadData}

```{r}
#| eval: false
#| include: false

# Downloaded Data - Processed
load(file = "./data/nfl_players.RData")
load(file = "./data/nfl_teams.RData")
load(file = "./data/nfl_rosters.RData")
load(file = "./data/nfl_rosters_weekly.RData")
load(file = "./data/nfl_schedules.RData")
load(file = "./data/nfl_combine.RData")
load(file = "./data/nfl_draftPicks.RData")
load(file = "./data/nfl_depthCharts.RData")
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/nfl_pbp.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/nfl_4thdown.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/nfl_participation.RData", fsep = ""))
load(file = "./data/nfl_actualFantasyPoints_weekly.RData")
load(file = "./data/nfl_injuries.RData")
load(file = "./data/nfl_snapCounts.RData")
load(file = "./data/nfl_espnQBR_seasonal.RData")
load(file = "./data/nfl_espnQBR_weekly.RData")
load(file = "./data/nfl_nextGenStats_weekly.RData")
load(file = "./data/nfl_advancedStatsPFR_seasonal.RData")
load(file = "./data/nfl_advancedStatsPFR_weekly.RData")
load(file = "./data/nfl_playerContracts.RData")
load(file = "./data/nfl_ftnCharting.RData")
load(file = "./data/nfl_playerIDs.RData")
load(file = "./data/nfl_rankings_draft.RData")
load(file = "./data/nfl_rankings_weekly.RData")
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/nfl_expectedFantasyPoints_weekly.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/nfl_expectedFantasyPoints_pbp.RData", fsep = ""))

# Calculated Data - Processed
load(file = "./data/nfl_actualStats_career.RData")
load(file = "./data/nfl_actualStats_seasonal.RData")
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/player_stats_weekly.RData", fsep = ""))
load(file = file.path(path, "/OneDrive - University of Iowa/Teaching/Courses/Fantasy Football/Data/player_stats_seasonal.RData", fsep = ""))
```

```{r}
# Downloaded Data - Processed
load(file = "./data/nfl_players.RData")
load(file = "./data/nfl_teams.RData")
load(file = "./data/nfl_rosters.RData")
load(file = "./data/nfl_rosters_weekly.RData")
load(file = "./data/nfl_schedules.RData")
load(file = "./data/nfl_combine.RData")
load(file = "./data/nfl_draftPicks.RData")
load(file = "./data/nfl_depthCharts.RData")
load(file = "./data/nfl_pbp.RData")
load(file = "./data/nfl_4thdown.RData")
load(file = "./data/nfl_participation.RData")
load(file = "./data/nfl_actualFantasyPoints_weekly.RData")
load(file = "./data/nfl_injuries.RData")
load(file = "./data/nfl_snapCounts.RData")
load(file = "./data/nfl_espnQBR_seasonal.RData")
load(file = "./data/nfl_espnQBR_weekly.RData")
load(file = "./data/nfl_nextGenStats_weekly.RData")
load(file = "./data/nfl_advancedStatsPFR_seasonal.RData")
load(file = "./data/nfl_advancedStatsPFR_weekly.RData")
load(file = "./data/nfl_playerContracts.RData")
load(file = "./data/nfl_ftnCharting.RData")
load(file = "./data/nfl_playerIDs.RData")
load(file = "./data/nfl_rankings_draft.RData")
load(file = "./data/nfl_rankings_weekly.RData")
load(file = "./data/nfl_expectedFantasyPoints_weekly.RData")
load(file = "./data/nfl_expectedFantasyPoints_pbp.RData")

# Calculated Data - Processed
load(file = "./data/nfl_actualStats_career.RData")
load(file = "./data/nfl_actualStats_seasonal.RData")
load(file = "./data/player_stats_weekly.RData")
load(file = "./data/player_stats_seasonal.RData")
```

### Specify Options {#machineLearningSpecifyOptions}

```{r}
options(scipen = 999) # prevent scientific notation
```

## Overview of Machine Learning {#sec-machineLearningOverview}

Machine learning takes us away from focusing on [causal inference](#sec-causalInference).
Machine learning does not care about which processes are causal—i.e., which processes influence the outcome.
Instead, machine learning cares about prediction—it cares about a predictor variable to the extent that it increases predictive accuracy regardless of whether it is causally related to the outcome.

Machine learning can be useful for leveraging big data and lots of predictor variable to develop predictive models with greater accuracy.
However, many machine learning techniques are black boxes—it is often unclear how or why certain predictions are made, which can make it difficult to interpret the model's decisions and understand the underlying relationships between variables.
Machine learning tends to be a data-driven, atheoretical technique.
This can result in overfitting.
Thus, when estimating machine learning models, it is common to keep a hold-out sample for use in cross-validation to evaluate the extent of shrinkage of model coefficients.
The data that the model is trained on is known as the "training data".
The data that the model was not trained on but is then is independently tested on—i.e., the hold-out sample—is the "test data".
Shrinkage occurs when predictor variables explain some random error variance in the original model.
When the model is applied to an independent sample (i.e., the test data), the predictive model will likely not perform quite as well, and the regressions coefficients will tend to get smaller (i.e., shrink).

If the test data were collected as part of the same processes as the original data and were merely held out for purposes of analysis, this is called internal cross-validation.
If the test data were collected separately from the original data used to train the model, this is called external cross-validation.

Most machine learning methods were developed with cross-sectional data in mind.
That is, they assume that each person has only one observation on the outcome variable.
However, with longitudinal data, each person has multiple observations on the outcome variable.

When performing machine learning, various approaches may help address this:

- transform data from [long to wide](#sec-longToWide) form, so that each person has only one row
- when designing the training and test sets, keep all measurements from the same person in the same data object (either the training or test set); do not have some measurements from a given person in the training set and other measurements from the same person in the test set
- use a machine learning approach that accounts for the clustered/nested nature of the data

## Types of Machine Learning {#sec-machineLearningTypes}

There are many approaches to machine learning.
This chapter discusses several key ones:

- supervised learning
    - continuous outcome (i.e., regression)
      - linear regression
      - lasso regression
      - ridge regression
      - elastic net regression
    - categorical outcome (i.e., classification)
      - logistic regression
      - support vector machine
      - random forest
      - extreme gradient boosting
- unsupervised learning
    - clustering
    - principal component analysis
- semi-supervised learning
- reinforcement learning
    - deep learning

*Ensemble* machine learning methods combine multiple machine learning approaches with the goal that combining multiple approaches might lead to more accurate predictions that any one method might be able to achieve on its own.

### Supervised Learning {#sec-machineLearningTypesSupervised}

[DEFINE SUPERVISED LEARNING]

Unlike linear and logistic regression, various machine learning techniques can handle multicollinearity, including LASSO regression, ridge regression, and elastic net regression.
Least absolute shrinkage and selection operator (LASSO) regression helps perform selection of which predictor variables to keep in the model by shrinking some coefficients to zero.
Ridge regression shrinks the coefficients of predictor variables toward zero, but not to zero, so it does not perform selection of which predictor variables to retain; this allows it to allow nonzero coefficients for multiple correlated predictor variables in the context of multicollinearity.
Elastic net involves a combination of LASSO and ridge regression; it performs selection of which predictor variables to keep by shrinking the coefficients of some predictor variables to zero, and it shrinks the coefficients of some predictor variables toward zero, to address multicollinearity.

Unless interactions or nonlinear terms are specified, linear, logistic, LASSO, ridge, and elastic net regresstion do not account for interactions among the predictor variables or for nonlinear associations between the predictor variables and the outcome variable.
By contrast, random forests and extreme gradient boosting do account for interactions among the predictor variables and for nonlinear associations between the predictor variables and the outcome variable.

### Unsupervised Learning {#sec-machineLearningTypesUnsupervised}

[DEFINE UNSUPERVISED LEARNING]

We describe [cluster analysis](#sec-clusterAnalysis) in @sec-clusterAnalysis.
We describe [principal component analysis](#sec-pca) in @sec-pca.

### Semi-supervised Learning {#sec-machineLearningTypesSemisupervised}

[DEFINE SEMI-SUPERVISED LEARNING]

### Reinforcement Learning {#sec-machineLearningTypesReinforcement}

[DEFINE REINFORCEMENT LEARNING]

## Data Processing {#sec-machineLearningDataProcessing}

```{r}
#| eval: false
#| include: false

varNames <- c(
  names(nfl_players),
  names(nfl_teams),
  names(nfl_rosters),
  names(nfl_rosters_weekly),
  names(nfl_schedules),
  names(nfl_combine),
  names(nfl_draftPicks),
  names(nfl_depthCharts),
  names(nfl_pbp),
  names(nfl_4thdown),
  names(nfl_participation),
  names(nfl_actualFantasyPoints_player_weekly),
  names(nfl_injuries),
  names(nfl_snapCounts),
  names(nfl_espnQBR_seasonal),
  names(nfl_espnQBR_weekly),
  names(nfl_nextGenStats_weekly),
  names(nfl_advancedStatsPFR_seasonal),
  names(nfl_advancedStatsPFR_weekly),
  names(nfl_playerContracts),
  names(nfl_ftnCharting),
  names(nfl_playerIDs),
  names(nfl_rankings_draft),
  names(nfl_rankings_weekly),
  names(nfl_expectedFantasyPoints_weekly),
  names(nfl_expectedFantasyPoints_pbp)
)

varNames <- unique(varNames)

write.csv(
  varNames,
  file = "./data/varNames.csv",
  row.names = FALSE
)

nfl_players$gsis_id
nfl_rosters$gsis_id
nfl_rosters_weekly$gsis_id
nfl_draftPicks$gsis_id
nfl_depthCharts$gsis_id
nfl_advancedStatsPFR_seasonal$gsis_id

nfl_actualStats_offense_weekly$player_id
nfl_expectedFantasyPoints_weekly$player_id

nfl_rankings$id

nfl_combine$pfr_id
nfl_advancedStatsPFR_seasonal$pfr_id
#nfl_advancedStatsPFR_seasonal$pfr_player_id

nfl_playerIDs$gsis_id
nfl_playerIDs$pfr_id
```

### Prepare Data for Merging {#sec-machineLearningPrepareDataForMerging}

```{r}
# Prepare data for merging
#-todo: calculate years_of_experience
## Use common name for the same (gsis_id) ID variable
nfl_actualFantasyPoints_player_weekly <- nfl_actualFantasyPoints_player_weekly %>% 
  rename(gsis_id = player_id)

nfl_actualFantasyPoints_player_seasonal <- nfl_actualFantasyPoints_player_seasonal %>% 
  rename(gsis_id = player_id)

player_stats_seasonal_offense <- player_stats_seasonal %>% 
  filter(position_group %in% c("QB","RB","WR","TE")) %>% 
  rename(gsis_id = player_id)

player_stats_weekly_offense <- player_stats_weekly %>% 
  filter(position_group %in% c("QB","RB","WR","TE")) %>% 
  rename(gsis_id = player_id)

nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>% 
  rename(gsis_id = player_id)

## Rename other variables to ensure common names

## Ensure variables with the same name have the same type
nfl_players <- nfl_players %>% 
  mutate(
    birth_date = as.Date(birth_date),
    jersey_number = as.character(jersey_number),
    gsis_it_id = as.character(gsis_it_id),
    years_of_experience = as.integer(years_of_experience))

player_stats_seasonal_offense <- player_stats_seasonal_offense %>% 
  mutate(
    birth_date = as.Date(birth_date),
    jersey_number = as.character(jersey_number),
    gsis_it_id = as.character(gsis_it_id))

nfl_rosters <- nfl_rosters %>% 
  mutate(
    draft_number = as.integer(draft_number))

nfl_rosters_weekly <- nfl_rosters %>% 
  mutate(
    draft_number = as.integer(draft_number))

nfl_depthCharts <- nfl_depthCharts %>% 
  mutate(
    season = as.integer(season))

nfl_expectedFantasyPoints_weekly <- nfl_expectedFantasyPoints_weekly %>% 
  mutate(
    season = as.integer(season),
    receptions = as.integer(receptions))

## Rename variables
#-todo: rename variables in expected fantasy points so they don't get coalesced with actual points
nfl_draftPicks <- nfl_draftPicks %>%
  rename(
    games_career = games,
    pass_completions_career = pass_completions,
    pass_attempts_career = pass_attempts,
    pass_yards_career = pass_yards,
    pass_tds_career = pass_tds,
    pass_ints_career = pass_ints,
    rush_atts_career = rush_atts,
    rush_yards_career = rush_yards,
    rush_tds_career = rush_tds,
    receptions_career = receptions,
    rec_yards_career = rec_yards,
    rec_tds_career = rec_tds,
    def_solo_tackles_career = def_solo_tackles,
    def_ints_career = def_ints,
    def_sacks_career = def_sacks
  )

# Check duplicate ids
player_stats_seasonal_offense %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1) %>% 
  head()

nfl_advancedStatsPFR_seasonal %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1, !is.na(gsis_id)) %>% 
  select(gsis_id, pfr_id, season, team, everything()) %>% 
  head()
```

### Merge Data {#sec-machineLearningMergeData}

```{r}
# Create lists of objects to merge, depending on data structure: id; or id-season; or id-season-week
#-todo: remove redundant variables
playerListToMerge <- list(
  nfl_players %>% filter(!is.na(gsis_id)),
  nfl_draftPicks %>% filter(!is.na(gsis_id)) %>% select(-season)
)

playerSeasonListToMerge <- list(
  player_stats_seasonal_offense %>% filter(!is.na(gsis_id), !is.na(season)),
  nfl_advancedStatsPFR_seasonal %>% filter(!is.na(gsis_id), !is.na(season))
)

playerSeasonWeekListToMerge <- list(
  nfl_rosters_weekly %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week)),
  #nfl_actualStats_offense_weekly,
  nfl_expectedFantasyPoints_weekly %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week))
  #nfl_advancedStatsPFR_weekly,
)

playerSeasonWeekPositionListToMerge <- list(
  nfl_depthCharts %>% filter(!is.na(gsis_id), !is.na(season), !is.na(week))
)

# Merge data
playerMerged <- playerListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id"),
    conflict = coalesce_xy)

playerSeasonMerged <- playerSeasonListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id","season"),
    conflict = coalesce_xy)

playerSeasonWeekMerged <- playerSeasonWeekListToMerge %>% 
  reduce(
    powerjoin::power_full_join,
    by = c("gsis_id","season","week"),
    conflict = coalesce_xy)

seasonalData <- powerjoin::power_full_join(
  playerSeasonMerged,
  playerMerged %>% select(-age, -years_of_experience, -team, -team_abbr, -team_seq, -current_team_id),
  by = "gsis_id",
  conflict = coalesce_xy
) %>% 
  filter(!is.na(season)) %>% 
  select(gsis_id, season, player_display_name, position, team, games, everything())

seasonalAndWeeklyData <- powerjoin::power_full_join(
  playerSeasonWeekMerged,
  seasonalData,
  by = c("gsis_id","season"),
  conflict = coalesce_xy
) %>% 
  filter(!is.na(week))

# Duplicate cases
seasonalData %>% 
  group_by(gsis_id, season) %>% 
  filter(n() > 1) %>% 
  head()

seasonalAndWeeklyData %>% 
  group_by(gsis_id, season, week) %>% 
  filter(n() > 1) %>% 
  head()
```

### Additional Processing {#sec-mlAdditionalProcessing}

```{r}
# Convert character and logical variables to factors
seasonalData <- seasonalData %>% 
  mutate(
    across(
      where(is.character),
      as.factor
    ),
    across(
      where(is.logical),
      as.factor
    )
  )
```

### Fill in Missing Data for Static Variables {#sec-fillMissingData}

```{r}
seasonalData <- seasonalData %>% 
  arrange(gsis_id, season) %>% 
  group_by(gsis_id) %>% 
  fill(
    player_name, player_display_name, pos, position, position_group,
    .direction = "downup") %>% 
  ungroup()
```

### Lag Fantasy Points {#sec-lagFantasyPoints}

```{r}
seasonalData_lag <- seasonalData %>% 
  arrange(gsis_id, season) %>% 
  group_by(gsis_id) %>% 
  mutate(
    fantasyPoints_lag = lead(fantasyPoints)
  ) %>% 
  ungroup()

seasonalData_lag %>% 
  select(gsis_id, player_display_name, season, fantasyPoints, fantasyPoints_lag)
```

### Subset to Predictor Variables and Outcome Variable {#sec-subsetToPredictorsAndOutcome}

```{r}
seasonalData_lag %>% select_if(~class(.) == "Date")
seasonalData_lag %>% select_if(is.character)
seasonalData_lag %>% select_if(is.factor)
seasonalData_lag %>% select_if(is.logical)

dropVars <- c(
  "birth_date", "loaded", "full_name", "player_name", "player_display_name", "display_name", "suffix", "headshot_url", "player", "pos",
  "espn_id", "sportradar_id", "yahoo_id", "rotowire_id", "pff_id", "fantasy_data_id", "sleeper_id", "pfr_id",
  "pfr_player_id", "cfb_player_id", "pfr_player_name", "esb_id", "gsis_it_id", "smart_id",
  "college", "college_name", "team_abbr", "current_team_id", "college_conference", "draft_club", "status_description_abbr",
  "status_short_description", "short_name", "headshot", "uniform_number", "jersey_number", "first_name", "last_name",
  "football_name", "team")

seasonalData_lag_subset <- seasonalData_lag %>% 
  dplyr::select(-any_of(dropVars))
```

### Separate by Position {#sec-separateByPosition}

```{r}
seasonalData_lag_subsetQB <- seasonalData_lag_subset %>% 
  filter(position == "QB") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_year, draft_number,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    completions:rushing_2pt_conversions, special_teams_tds, contains(".pass"), contains(".rush"))

seasonalData_lag_subsetRB <- seasonalData_lag_subset %>% 
  filter(position == "RB") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_year, draft_number,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    carries:special_teams_tds, contains(".rush"), contains(".rec"))

seasonalData_lag_subsetWR <- seasonalData_lag_subset %>% 
  filter(position == "WR") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_year, draft_number,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    carries:special_teams_tds, contains(".rush"), contains(".rec"))

seasonalData_lag_subsetTE <- seasonalData_lag_subset %>% 
  filter(position == "TE") %>% 
  select(
    gsis_id, season, games, gs, years_of_experience, age, ageCentered20, ageCentered20Quadratic,
    height, weight, rookie_year, draft_number,
    fantasy_points, fantasy_points_ppr, fantasyPoints, fantasyPoints_lag,
    carries:special_teams_tds, contains(".rush"), contains(".rec"))
```

### Split into Test and Training Data {#sec-splitTestTraining}

CURRENTLY (WILL CHANGE):

- seasonalData: 1999-2023
- seasonalData_lag: 1999-2022 (predicting fantasy points in 2023)
- newData_seasonal: 2023 (to be used for predicting fantasy points in 2024)

to create:

- seasonalData_lag_all: 1999-2023 (predicting fantasy points in 2024)
- seasonalData_lag_train: 1999-2022 (predicting fantasy points in 2023), most players
- seasonalData_lag_test: 1999-2022 (predicting fantasy points in 2023), some retired players
- (eventually, newData_seasonal, which is derived from the imputed version of seasonalData_lag_all, and then keeps only data from 2023, thus predicting fantasy points in 2024, and removing the fantasy_points_lag variable)

To impute:

- seasonalData_lag_all: 1999-2023 (predicting fantasy points in 2024)
- seasonalData_lag_train: 1999-2022 (predicting fantasy points in 2023), most players
- seasonalData_lag_test: 1999-2022 (predicting fantasy points in 2023), some retired players

IMPUTATION AND MODEL PROCESS:

IMPUTATION:
- training data
- test data
- all data (training and test data), used for generating next year predictions

MODEL:
- training data (imputed version of seasonalData_lag_all): all seasons (except 2023 predicting 2024), most players

EVALUATE MODEL PREDICTIONS:
- test data (imputed version of seasonalData_lag_test): all seasons (except 2023 predicting 2024), some retired players

GENERATE MODEL PREDICTIONS
- next year data (newData_seasonal): 2023 (predicting 2024)

```{r}
seasonalData_lag_qb_all <- seasonalData_lag_subsetQB
seasonalData_lag_rb_all <- seasonalData_lag_subsetRB
seasonalData_lag_wr_all <- seasonalData_lag_subsetWR
seasonalData_lag_te_all <- seasonalData_lag_subsetTE

set.seed(52242)

activeQBs <- unique(seasonalData_lag_qb_all$gsis_id[which(seasonalData_lag_qb_all$season == max(seasonalData_lag_qb_all$season, na.rm = TRUE))])
retiredQBs <- unique(seasonalData_lag_qb_all$gsis_id[which(seasonalData_lag_qb_all$gsis_id %ni% activeQBs)])
numQBs <- length(unique(seasonalData_lag_qb_all$gsis_id))
qbHoldoutIDs <- sample(retiredQBs, size = ceiling(.2 * numQBs)) # holdout 20% of players

activeRBs <- unique(seasonalData_lag_rb_all$gsis_id[which(seasonalData_lag_rb_all$season == max(seasonalData_lag_rb_all$season, na.rm = TRUE))])
retiredRBs <- unique(seasonalData_lag_rb_all$gsis_id[which(seasonalData_lag_rb_all$gsis_id %ni% activeRBs)])
numRBs <- length(unique(seasonalData_lag_rb_all$gsis_id))
rbHoldoutIDs <- sample(retiredRBs, size = ceiling(.2 * numRBs)) # holdout 20% of players

activeWRs <- unique(seasonalData_lag_wr_all$gsis_id[which(seasonalData_lag_wr_all$season == max(seasonalData_lag_wr_all$season, na.rm = TRUE))])
retiredWRs <- unique(seasonalData_lag_wr_all$gsis_id[which(seasonalData_lag_wr_all$gsis_id %ni% activeWRs)])
numWRs <- length(unique(seasonalData_lag_wr_all$gsis_id))
wrHoldoutIDs <- sample(retiredWRs, size = ceiling(.2 * numWRs)) # holdout 20% of players

activeTEs <- unique(seasonalData_lag_te_all$gsis_id[which(seasonalData_lag_te_all$season == max(seasonalData_lag_te_all$season, na.rm = TRUE))])
retiredTEs <- unique(seasonalData_lag_te_all$gsis_id[which(seasonalData_lag_te_all$gsis_id %ni% activeTEs)])
numTEs <- length(unique(seasonalData_lag_te_all$gsis_id))
teHoldoutIDs <- sample(retiredTEs, size = ceiling(.2 * numTEs)) # holdout 20% of players
  
seasonalData_lag_qb_train <- seasonalData_lag_qb_all %>% 
  filter(gsis_id %ni% qbHoldoutIDs)
seasonalData_lag_qb_test <- seasonalData_lag_qb_all %>% 
  filter(gsis_id %in% qbHoldoutIDs)

seasonalData_lag_rb_train <- seasonalData_lag_rb_all %>% 
  filter(gsis_id %ni% rbHoldoutIDs)
seasonalData_lag_rb_test <- seasonalData_lag_rb_all %>% 
  filter(gsis_id %in% rbHoldoutIDs)

seasonalData_lag_wr_train <- seasonalData_lag_wr_all %>% 
  filter(gsis_id %ni% wrHoldoutIDs)
seasonalData_lag_wr_test <- seasonalData_lag_wr_all %>% 
  filter(gsis_id %in% wrHoldoutIDs)

seasonalData_lag_te_train <- seasonalData_lag_te_all %>% 
  filter(gsis_id %ni% teHoldoutIDs)
seasonalData_lag_te_test <- seasonalData_lag_te_all %>% 
  filter(gsis_id %in% teHoldoutIDs)
```

### Impute the Missing Data {#sec-missingDataImputation}

NOTES:

- CONSIDER USING RETIRED PLAYERS AS THE HOLDOUT SAMPLE
- CONSIDER WIDENING THE DATA TO AVOID MULTILEVEL IMPUTATION
- CONSIDER IMPUTING THE TRAINING AND TEST DATA SEPARATELY BY POSITION

Here is a vignette demonstrating how to impute missing data using `missForest()`: <https://rpubs.com/lmorgan95/MissForest> (archived at: <https://perma.cc/6GB4-2E22>).
Below, we impute the training data (and all data) separately by position.
We then use the imputed training data to make out-of-sample predictions to fill in the missing data for the testing data.
We do not want to impute the training and testing data together so that we can keep them separate for the purposes of cross-validation.
However, we impute all data (training and test data together) for purposes of making out-of-sample predictions from the machine learning models to predict players' performance next season (when actuals are not yet available for evaluating their accuracy).

::: {#nte-machineLearningImputeMissingData .callout-note title="Impute missing data for machine learning"}
Note: the following code takes a while to run.
:::

```{r}
# QBs
seasonalData_lag_qb_all_imp <- missRanger::missRanger(
  seasonalData_lag_qb_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_qb_all_imp

data_all_qb <- seasonalData_lag_qb_all_imp$data
data_all_qb_matrix <- data_all_qb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_qb <- data_all_qb %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag)
newData_qb_matrix <- data_all_qb_matrix[
  data_all_qb_matrix[, "season"] == max(data_all_qb_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_qb <- which(colnames(newData_qb_matrix) == "fantasyPoints_lag")
newData_qb_matrix <- newData_qb_matrix[, -dropCol_qb, drop = FALSE]

seasonalData_lag_qb_train_imp <- missRanger::missRanger(
  seasonalData_lag_qb_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_qb_train_imp

data_train_qb <- seasonalData_lag_qb_train_imp$data
data_train_qb_matrix <- data_train_qb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_qb_test_imp <- predict(
  object = seasonalData_lag_qb_train_imp,
  newdata = seasonalData_lag_qb_test,
  seed = 52242)

data_test_qb <- seasonalData_lag_qb_test_imp
data_test_qb_matrix <- data_test_qb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

# RBs
seasonalData_lag_rb_all_imp <- missRanger::missRanger(
  seasonalData_lag_rb_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_rb_all_imp

data_all_rb <- seasonalData_lag_rb_all_imp$data
data_all_rb_matrix <- data_all_rb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_rb <- data_all_rb %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag)
newData_rb_matrix <- data_all_rb_matrix[
  data_all_rb_matrix[, "season"] == max(data_all_rb_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_rb <- which(colnames(newData_rb_matrix) == "fantasyPoints_lag")
newData_rb_matrix <- newData_rb_matrix[, -dropCol_rb, drop = FALSE]

seasonalData_lag_rb_train_imp <- missRanger::missRanger(
  seasonalData_lag_rb_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_rb_train_imp

data_train_rb <- seasonalData_lag_rb_train_imp$data
data_train_rb_matrix <- data_train_rb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_rb_test_imp <- predict(
  object = seasonalData_lag_rb_train_imp,
  newdata = seasonalData_lag_rb_test,
  seed = 52242)

data_test_rb <- seasonalData_lag_rb_test_imp
data_test_rb_matrix <- data_test_rb %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

# WRs
seasonalData_lag_wr_all_imp <- missRanger::missRanger(
  seasonalData_lag_wr_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_wr_all_imp

data_all_wr <- seasonalData_lag_wr_all_imp$data
data_all_wr_matrix <- data_all_wr %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_wr <- data_all_wr %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag)
newData_wr_matrix <- data_all_wr_matrix[
  data_all_wr_matrix[, "season"] == max(data_all_wr_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_wr <- which(colnames(newData_wr_matrix) == "fantasyPoints_lag")
newData_wr_matrix <- newData_wr_matrix[, -dropCol_wr, drop = FALSE]

seasonalData_lag_wr_train_imp <- missRanger::missRanger(
  seasonalData_lag_wr_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_wr_train_imp

data_train_wr <- seasonalData_lag_wr_train_imp$data
data_train_wr_matrix <- data_train_wr %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_wr_test_imp <- predict(
  object = seasonalData_lag_wr_train_imp,
  newdata = seasonalData_lag_wr_test,
  seed = 52242)

data_test_wr <- seasonalData_lag_wr_test_imp
data_test_wr_matrix <- data_test_wr %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

# TEs
seasonalData_lag_te_all_imp <- missRanger::missRanger(
  seasonalData_lag_te_all,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_te_all_imp

data_all_te <- seasonalData_lag_te_all_imp$data
data_all_te_matrix <- data_all_te %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
newData_te <- data_all_te %>% 
  filter(season == max(season, na.rm = TRUE)) %>% 
  select(-fantasyPoints_lag)
newData_te_matrix <- data_all_te_matrix[
  data_all_te_matrix[, "season"] == max(data_all_te_matrix[, "season"], na.rm = TRUE), # keep only rows with the most recent season
  , # all columns
  drop = FALSE]

dropCol_te <- which(colnames(newData_te_matrix) == "fantasyPoints_lag")
newData_te_matrix <- newData_te_matrix[, -dropCol_te, drop = FALSE]

seasonalData_lag_te_train_imp <- missRanger::missRanger(
  seasonalData_lag_te_train,
  pmm.k = 5,
  verbose = 2,
  seed = 52242,
  keep_forests = TRUE)

seasonalData_lag_te_train_imp

data_train_te <- seasonalData_lag_te_train_imp$data
data_train_te_matrix <- data_train_te %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()

seasonalData_lag_te_test_imp <- predict(
  object = seasonalData_lag_te_train_imp,
  newdata = seasonalData_lag_te_test,
  seed = 52242)

data_test_te <- seasonalData_lag_te_test_imp
data_test_te_matrix <- data_test_te %>%
  mutate(across(where(is.factor), ~ as.numeric(as.integer(.)))) %>% 
  as.matrix()
```

## Identify Cores for Parallel Processing {#sec-coresParallel}

```{r}
#| include: false

num_cores <- parallel::detectCores()
num_true_cores <- parallel::detectCores(logical = FALSE)
```

```{r}
#| eval: false

num_cores <- detectCores() - 1
num_true_cores <- parallel::detectCores(logical = FALSE) - 1
```

```{r}
num_cores
```

## Fitting the Traditional Regression Models {#sec-fittingModels-regression}

### Regression with One Predictor {#sec-regressionOnePredictor}

### Regression with Multiple Predictors {#sec-regressionMultiplePredictors}

## Fitting the Machine Learning Models {#sec-fittingModels-machineLearning}

### Least Absolute Shrinkage and Selection Option (LASSO) {#sec-lasso}

### Ridge Regression {#sec-ridgeRegression}

### Elastic Net {#sec-elasticNet}

### Random Forest Machine Learning {#sec-randomForest}

#### Cross-Sectional Data {#sec-randomForestCrossSectionalData}

```{r}
cl <- parallel::makeCluster(num_cores)
doParallel::registerDoParallel(cl)

set.seed(52242)

randomForest_qb <- caret::train(
  fantasyPoints_lag ~ ., # use all predictors
  data = seasonalData_lag_subsetQB_imp$ximp,
  method = "rf",
  trControl = trainControl(
    method = "cv",
    number = 10)) # 10-fold cross-validation

randomForest_rb <- caret::train(
  fantasyPoints_lag ~ ., # use all predictors
  data = seasonalData_lag_subsetRB_imp$ximp,
  method = "rf",
  trControl = trainControl(
    method = "cv",
    number = 10)) # 10-fold cross-validation

randomForest_wr <- caret::train(
  fantasyPoints_lag ~ ., # use all predictors
  data = seasonalData_lag_subsetWR_imp$ximp,
  method = "rf",
  trControl = trainControl(
    method = "cv",
    number = 10)) # 10-fold cross-validation

randomForest_te <- caret::train(
  fantasyPoints_lag ~ ., # use all predictors
  data = seasonalData_lag_subsetTE_imp$ximp,
  method = "rf",
  trControl = trainControl(
    method = "cv",
    number = 10)) # 10-fold cross-validation

stopCluster(cl)

print(randomForest_qb)
print(randomForest_rb)
print(randomForest_wr)
print(randomForest_te)

newData_seasonalQB_imp$ximp$fantasyPoints_lag <- predict(
  randomForest_qb,
  newdata = newData_seasonalQB_imp$ximp
)

newData_seasonalRB_imp$ximp$fantasyPoints_lag <- predict(
  randomForest_rb,
  newdata = newData_seasonalRB_imp$ximp
)

newData_seasonalWR_imp$ximp$fantasyPoints_lag <- predict(
  randomForest_wr,
  newdata = newData_seasonalWR_imp$ximp
)

newData_seasonalTE_imp$ximp$fantasyPoints_lag <- predict(
  randomForest_te,
  newdata = newData_seasonalTE_imp$ximp
)

newData_seasonalQB$fantasyPoints_lag <- newData_seasonalQB_imp$ximp$fantasyPoints_lag
newData_seasonalRB$fantasyPoints_lag <- newData_seasonalRB_imp$ximp$fantasyPoints_lag
newData_seasonalWR$fantasyPoints_lag <- newData_seasonalWR_imp$ximp$fantasyPoints_lag
newData_seasonalTE$fantasyPoints_lag <- newData_seasonalTE_imp$ximp$fantasyPoints_lag

newData_seasonalQB <- left_join(
  newData_seasonalQB,
  newData_seasonal %>% select(gsis_id, player_display_name, team),
  by = "gsis_id"
)

newData_seasonalRB <- left_join(
  newData_seasonalRB,
  newData_seasonal %>% select(gsis_id, player_display_name, team),
  by = "gsis_id"
)

newData_seasonalWR <- left_join(
  newData_seasonalWR,
  newData_seasonal %>% select(gsis_id, player_display_name, team),
  by = "gsis_id"
)

newData_seasonalTE <- left_join(
  newData_seasonalTE,
  newData_seasonal %>% select(gsis_id, player_display_name, team),
  by = "gsis_id"
)

newData_seasonalQB %>%
  arrange(-fantasyPoints_lag) %>% 
  select(gsis_id, player_display_name, fantasyPoints_lag)

newData_seasonalRB %>%
  arrange(-fantasyPoints_lag) %>% 
  select(gsis_id, player_display_name, fantasyPoints_lag)

newData_seasonalWR %>%
  arrange(-fantasyPoints_lag) %>% 
  select(gsis_id, player_display_name, fantasyPoints_lag)

newData_seasonalTE %>%
  arrange(-fantasyPoints_lag) %>% 
  select(gsis_id, player_display_name, fantasyPoints_lag)
```

#### Longitudinal Data {#sec-randomForestLongitudinalData}

[@Hu2023]

```{r}
#| eval: false

library("LongituRF")

smerf <- LongituRF::MERF(
  X = seasonalData_subsetQB_imp$ximp[,c("passing_epa")] %>% as.matrix(),
  Y = seasonalData_subsetQB$fantasyPoints_lag,
  Z = seasonalData_subsetQB_imp$ximp[,c("pacr")] %>% as.matrix(),
  id = seasonalData_subsetQB$gsis_id,
  time = seasonalData_subsetQB_imp$ximp[,c("ageCentered20")] %>% as.matrix(),
  ntree = 500,
  sto = "BM")

smerf$forest # the fitted random forest (obtained at the last iteration)
smerf$random_effects # the predicted random effects for each player
smerf$omega # the predicted stochastic processes
plot(smerf$Vraisemblance) # evolution of the log-likelihood
smerf$OOB # OOB error at each iteration
```

### *k*-Fold Cross-Validation {#sec-kfoldCV}

### Leave-One-Out (LOO) Cross-Validation {#sec-looCV}

### Combining Tree-Boosting with Mixed Models {#sec-treeBoosting}

Adapted from here:
<https://towardsdatascience.com/mixed-effects-machine-learning-for-longitudinal-panel-data-with-gpboost-part-iii-523bb38effc>

#### Process Data {#sec-treeBoostingProcessData}

If using a gamma distribution, it requires positive-only values:

```{r}
#data_train_qb_matrix[,"fantasyPoints_lag"][data_train_qb_matrix[,"fantasyPoints_lag"] <= 0] <- 0.01
```

#### Specify Predictor Variables {#sec-treeBoostingSpecifyPredictors}

```{r}
pred_vars_qb <- data_train_qb_matrix %>% 
  as_tibble() %>% 
  select(-fantasyPoints_lag, -ageCentered20, ageCentered20Quadratic) %>% # -gsis_id
  names()

pred_vars_qb_categorical <- "gsis_id" # to specify categorical predictors
```

#### Specify General Model Options {#sec-treeBoostingSpecifyGeneralModelOptions}

```{r}
model_likelihood <- "gaussian" # gamma
nrounds <- 1000
```

#### Identify Optimal Tuning Parameters {#sec-treeBoostingOptimalTuningParameters}

```{r}
# Partition training data into inner training data and validation data
ntrain_qb <- dim(data_train_qb_matrix)[1]

set.seed(52242)
valid_tune_idx_qb <- sample.int(ntrain_qb, as.integer(0.2*ntrain_qb))

folds_qb <- list(valid_tune_idx_qb)

# Specify parameter grid, gp_model, and gpb.Dataset
param_grid_qb <- list(
  "learning_rate" = c(1,0.1,0.01),
  "max_depth" = c(1,2,3,5,10),
  "min_data_in_leaf" = c(10,100,1000),
  "lambda_l2" = c(0,1,10))

other_params_qb <- list(num_leaves = 2^10)

gp_model_qb <- GPModel(
  group_data = data_train_qb_matrix[,"gsis_id"],
  likelihood = model_likelihood,
  group_rand_coef_data = cbind(
    data_train_qb_matrix[,"ageCentered20"],
    data_train_qb_matrix[,"ageCentered20Quadratic"]),
  ind_effect_group_rand_coef = c(1,1))

gp_data_qb <- gpb.Dataset(
  data = data_train_qb_matrix[,pred_vars_qb],
  categorical_feature = pred_vars_qb_categorical,
  label = data_train_qb_matrix[,"fantasyPoints_lag"])

# Find optimal tuning parameters
opt_params_qb <- gpb.grid.search.tune.parameters(
  param_grid = param_grid_qb,
  params = other_params_qb,
  num_try_random = NULL,
  folds = folds_qb,
  data = gp_data_qb,
  gp_model = gp_model_qb,
  nrounds = nrounds,
  early_stopping_rounds = 10,
  verbose_eval = 1,
  metric = "mse")

opt_params_qb
```

#### Specify Model and Tuning Parameters {#sec-treeBoostingSpecifyModel}

```{r}
gp_model_qb <- GPModel(
  group_data = data_train_qb_matrix[,"gsis_id"],
  likelihood = model_likelihood,
  group_rand_coef_data = cbind(
    data_train_qb_matrix[,"ageCentered20"],
    data_train_qb_matrix[,"ageCentered20Quadratic"]),
  ind_effect_group_rand_coef = c(1,1))

gp_data_qb <- gpb.Dataset(
  data = data_train_qb_matrix[,pred_vars_qb],
  categorical_feature = pred_vars_qb_categorical,
  label = data_train_qb_matrix[,"fantasyPoints_lag"])

params_qb <- list(
  learning_rate = 1,
  max_depth = 10,
  num_leaves = 2^10,
  min_data_in_leaf = 10,
  lambda_l2 = 0,
  num_threads = num_cores)

nrounds_qb <- 7000 # identify optimal number of trees through iteration and cross-validation

#gp_model_qb$set_optim_params(params = list(optimizer_cov = "nelder_mead")) # to speed up model estimation
```

#### Fit Model {#sec-treeBoostingFitModel}

```{r}
gp_model_fit_qb <- gpb.train(
  data = gp_data_qb,
  gp_model = gp_model_qb,
  nrounds = nrounds_qb,
  params = params_qb) # verbose = 0

summary(gp_model_qb) # Estimated random effects model
```

#### Evaluate Accuracy of Model on Test Data {#sec-treeBoostingModelAccuracy}

```{r}
# Test Model on Test Data
pred_test_qb <- predict(
  gp_model_fit_qb,
  data = data_test_qb_matrix[,pred_vars_qb],
  group_data_pred = data_test_qb_matrix[,"gsis_id"],
  group_rand_coef_data_pred = cbind(
    data_test_qb_matrix[,"ageCentered20"],
    data_test_qb_matrix[,"ageCentered20Quadratic"]),
  predict_var = FALSE,
  pred_latent = FALSE)

y_pred_test_qb <- pred_test_qb[["response_mean"]]
cbind(y_pred_test_qb, data_test_qb_matrix[,"fantasyPoints_lag"])

petersenlab::accuracyOverall(
  predicted = y_pred_test_qb,
  actual = data_test_qb_matrix[,"fantasyPoints_lag"],
  dropUndefined = TRUE
)
```

#### Generate Predictions for Next Season {#sec-treeBoostingModelPredictions}

```{r}
# Generate model predictions for next season
pred_nextYear_qb <- predict(
  gp_model_fit_qb,
  data = newData_qb_matrix[,pred_vars_qb],
  group_data_pred = newData_qb_matrix[,"gsis_id"],
  group_rand_coef_data_pred = cbind(
    newData_qb_matrix[,"ageCentered20"],
    newData_qb_matrix[,"ageCentered20Quadratic"]),
  predict_var = FALSE,
  pred_latent = FALSE)

newData_qb$fantasyPoints_lag <- pred_nextYear_qb$response_mean

# Merge with player names
newData_qb <- left_join(
  newData_qb,
  nfl_playerIDs %>% select(gsis_id, name),
  by = "gsis_id"
)

newData_qb %>% 
  arrange(-fantasyPoints_lag) %>% 
  select(name, fantasyPoints_lag, fantasyPoints)
```

## Conclusion {#sec-machineLearningConclusion}

::: {.content-visible when-format="html"}

## Session Info {#sec-machineLearningSessionInfo}

```{r}
sessionInfo()
```

:::