The goal of this notebook is to implement classic regression models and implement metrics from scratch to calculate the quality of the regressors, aslo show some interesting plots that tells the steps involved. Join me on this funny journey ☕
In this project we use data from the SBUX stock market. The data goes from 2019-06-05 to 2024-06-05. The data was extracted directly from Yahoo Finance using the API and the library yfinance.
Metric | Formula | Interpretation |
---|---|---|
MSE | Lower values indicate a better fit | |
RMSE | Lower values indicate a better fit, same units as ( y ) | |
MAE | Lower values indicate a better fit | |
RSE | Values closer to 0 indicate a better fit | |
RAE | Values closer to 0 indicate a better fit | |
R | Values closer to 1 or -1 indicate a strong linear relationship | |
Values closer to 1 indicate a better fit |
Using import yfinance as yf
and sbux_data = yf.download('SBUX', period='5y')
we have the following dataset.
Notice that the attributes are Date, Open, High, Low, Close and Volume.
The dataset header is the following:
Date | Open | High | Low | Close | Volume |
---|---|---|---|---|---|
2019-06-05 | 78.790001 | 79.970001 | 78.660004 | 79.959999 | 7437100 |
2019-06-06 | 80.029999 | 81.629997 | 79.900002 | 81.400002 | 10457200 |
2019-06-07 | 81.599998 | 83.330002 | 81.510002 | 82.480003 | 11278800 |
2019-06-10 | 82.849998 | 82.860001 | 81.379997 | 81.930000 | 8102800 |
2019-06-11 | 82.300003 | 82.860001 | 81.849998 | 82.370003 | 6226400 |
Here, we take the Close attribute from the entire historic and plot it to see the evoluting of the stock over time.
Becasuse we seek to implement regression over time-series data, we need to take sequences, therefore, to make train and test splits, we take a sequence for train split and another sequence for test split.
- The train split goes from Jan 01, 2022 to Mar 01, 2024.
- The test split goes from Mar 01, 2023 to Jun 03, 2024. We can see this in the following image:
In order to find paterns in the data, we give the data some special estructure that we call chunks i.e.time windows:
- take the data and makes chunk, the chunk shape is a window of (7,5)
- 7 are the days and 5 are the attributes (open, high, low, close, volume)
- for wach chunk, we add a target that will be the Close value of the 8th day
- the chunks will have an offset of 1 day i.e. the window will slide one day to make the next time-window and the next target value
In order to employ classic regression models, we need to give the data a valid shape to be compatible with the model. Thats why we flateen our data. Previously oru data has a shape of (7,5), now will be (35,)
We implement a normalization using standard-deviation like:
It's an ensemble learning method that operates by constructing a multitude of decision trees during training time and outputting the mean prediction of the individual trees for regression problems. It's known for its robustness against overfitting and high performance.
It's another ensemble method that builds trees one at a time, where each new tree helps to correct errors made by the previously trained set of trees. Gradient boosting tends to be more sensitive to overfitting than Random Forests but can often yield better performance if tuned correctly.
This is similar to Random Forests, but with one key difference: instead of searching for the best split point in the feature when building trees, it selects a random split point. This randomization can lead to faster training times and sometimes better performance, especially for high-dimensional data.
The previously regressor models where implemented directly from the scikit-learn modules.
- the trainning was implemented with the
fit()
method on the train split data - and the prediction with the
predict()
method with the test split data
The following image shows the steps summarized in this project, notice that the last plots shows the results of prediction values for Close for each regressor model.
In this image, we can see a comparison for the actual values and th predicted ones for each regressor.
In this image, we can see a comparison for the residual analysis for each regressor.
In this plot, we can observe a summarization of the metrics calculated for each regressors model.
As we can observe, the regressor models have a very similar behaviour in all the seven metrics calculated. Nevertheless, the Random Forest Regressor show an slightly improvement in the metrics.
Regressors | MSE | RMSE | MAE | RSE | RAE | R | R2 |
---|---|---|---|---|---|---|---|
RandomForestReg | 9.4459 | 3.0734 | 1.9380 | 3.1298 | 0.3270 | 0.8927 | 0.7737 |
GradientBoostingReg | 9.9440 | 3.1534 | 2.0396 | 3.2113 | 0.3441 | 0.8845 | 0.7617 |
ExtraTreesReg | 10.3015 | 3.2096 | 1.9351 | 3.2685 | 0.3265 | 0.8804 | 0.7532 |
Regression analysis is a powerful tool for understanding relationships between variables and making predictions. By carefully considering the assumptions and properly interpreting the results, regression models can provide valuable insights in many fields, from economics and finance to biology and engineering.
In this project, we explored various metrics to evaluate the performance of regression models. The metrics we focused on included Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Relative Squared Error (RSE), Relative Absolute Error (RAE), Mean Absolute Percentage Error (MAPE), the Correlation Coefficient (R), and the Coefficient of Determination (R2).
Each metric provides unique insights into the performance of regression models:
- MSE and RMSE are useful for understanding the average magnitude of errors, with RMSE being particularly interpretable due to its same units as the dependent variable.
- MAE offers a robust measure less sensitive to outliers compared to MSE and RMSE.
- RSE and RAE provide relative measures comparing model performance to a baseline model, with values closer to 0 indicating superior performance.
- R helps assess the strength and direction of the linear relationship between actual and predicted values, with values closer to 1 or -1 indicating stronger linear relationships.
- R2 indicates the proportion of variance in the dependent variable explained by the independent variables, with higher values signifying better model fit.
The comprehensive evaluation using these metrics allows an understanding of model performance. For example, while a model might exhibit low MSE and RMSE, indicating small average errors, it could still have high RSE or RAE values if the baseline model performs similarly well. Similarly, a high R2 value signifies that a significant portion of the variance is explained by the model, but it doesn’t provide information about the actual size of prediction errors, which metrics like MAE and RMSE do.
In conclusion, employing a diverse set of evaluation metrics provides a holistic view of the regression model’s effectiveness, enabling more informed decisions in model selection and refinement. The combination of these metrics ensures that the models not only fit the data well but also generalize effectively to new data, ultimately leading to more accurate and reliable predictions in practical applications.