The goal of this project is to build a machine learning model to predict the Formula One World Constructors’ Championship Standings for the upcoming 2023 season.
The following listed files include all my hard code and model building steps. My entire project, including the thorough thought and work process, is neatly explained and presented here.
tidyverse | tidymodels | parsnip | kknn | recipes | workflows | glmnet | magrittr | ranger | |
naniar | visdat | dplyr | ggplot2 | ggthemes | corrplot | vip | themis | kableExtra | ISLR |
Some of these packages are necessary for the model building process, while others are for concise and convenient coding and visual presentation experience.
The following files are a representation of my overall workflow. I put raw code in .R
script files and saved important arguments or variables for later use in the correspondingly named .rda
files.
R script file read_data.R
includes code used to read in csv files.
This modify_data.R
file includes code used to manipulate and join the data sets. Inital data cleaning is also executed in this R script file, which can range from converting timestamps into workable numeric variables to streamlining several related variables into one useful parameter.
Exploratory data analysis code is included in R script file eda.R
. This file includes code used to do further cleaning with a focus on missing data. This file also includes some visual exploratory data analysis, mostly looking at possible surface level trends and relationships between variables, which provides some good beginning insight before considering potential models.
This file includes steps to set up the machine learning models. This involves training and testing data splits and building a recipe with the desired response variable and predictors. Using the recipe()
tidymodels function allows us to dummy code categorical predictors and impute missing values in the predictors within the step of creating the recipe. I further set up k-fold cross validation and apply different machine learning models to the recipe. I developed the following models to have a thorough discussion of the truly best fitting model.
linear | polynomial regression | k nearest neighbors (knn) | elastic net linear regression |
elastic net with lasso regression | elastic net with ridge regression | random forest |
To build the models, we use the following steps:
- set up each model with tuning parameter, the engine, and regression mdoe
- set up a
workflow()
with each model and the recipe - set up a tuning grid with
grid_regular()
and levels for tuning the parameters - tune each model with
tune_grid()
using the corresponding workflow, k-fold cross validation, and tuning grid - collect root mean squared error (RMSE) metric of tuned models and find the lowest RMSE for each model
The corresponding .R
script file is not in here, but the results are saved in this .rda
file. I analyzed the performance of the more noteworthy models: elastic net, polynomial regression, knn, and random forest. For a thorough explanation and interpretation of the parameters and performance of these models, refer to the completed presentation here.
After analyzing RMSE depending on the tuning parameters, I conclude the random forest model with parameters mtry=5
, trees=400
, and min_n=20
is the best performing model. I use that model to fit on the testing split and once again analyze the RMSE.