The objective of this project is to create an accurate as possible machine learning classification model for determining whether or not a baseball team is going to win a game as local.
A dataset gathered from a famous sports website will be used to gain initial insights for this project.
- Missing values handling
- Temporal relationship
- Numerical variables distribution
- Categorical variables
- Dependent variable relantionships
- Correlations
- Conclusions from data
In this step, new features summarizing previous games performance will be generated
In this step, a new analysis using the generated variables in the step will be performed
The features created now correlate better than original ones, however the correlation keeps weak. A model of classification will be constructed with this features as intances, we will try to choose a not flexible model (linear model) because correlation is not strong enough, therefore variance is big and we dont want to fall in an overfit problem
In this step, the created dataset will be cleaned. Secondly a a train and test set will be generated. Finally the sets numerical variables will be normalized using a standard deviation scaler.
Two scaled sets were generated and are ready to apply machine learning models on them (Train set: 8479rows, Test set: 2120 rows)
Naive- Bayes classicator algorithm was choosed because is among one of the best unflexible classificators when large variance is observed in data.
Even though the predictors didnt correlate so well with the output, a classificator with a reasonable accuracy was possible to create, especially for the True predictions of a Local Team to win. (63%)
Application of this classificator for sport betting should be analyzed since the game odds of sportbooks are so tricky and maybe a revenue with this accuracy could no be possible without a correct betting strategy.