Winton Stock Market Challenge was a competition hosted by Winton on Kaggle in 2016.
The main task of this competition was predict the interday and intraday return of a stock, given the history of the past few days.
NOTE:
To view the final code with the interactive graphs, click here
- Developed a data pre-processing pipeline
- Tuned and Trained a Multi-Output Multi-Layer Perceptron Regression Model to predict stock returns based on returns from past two days and a set of features
In this competition the challenge is to predict the return of a stock, given the history of the past few days.
We provide 5-day windows of time, days D-2, D-1, D, D+1, and D+2. You are given returns in days D-2, D-1, and part of day D, and you are asked to predict the returns in the rest of day D, and in days D+1 and D+2.
During day D, there is intraday return data, which are the returns at different points in the day. We provide 180 minutes of data, from t=1 to t=180. In the training set you are given the full 180 minutes, in the test set just the first 120 minutes are provided.
For each 5-day window, we also provide 25 features, Feature_1 to Feature_25. These may or may not be useful in your prediction.
Each row in the dataset is an arbitrary stock at an arbitrary 5 day time window.
- Python
- Pandas
- Numpy
- Matplotlib
- Seaborn
- Plotly
- Scikit Learn
- Principle Componnent Analysis
- Iterative Imputer
- Random Forest Regressor
- Multi-layer Perceptron Regressor
- Multi Output Regressor
Exploratory Data Analysis is performed to explore the structure of the data, identify categorical and continuos data feilds, missing values, and corelations amongst different data columns
Corelation Heatmap between diffent features:
As observed in the corelation heatmap above, alot of features are strongly corelated to each other. This means that it is possibble to apply Dimentionality Reduction methods such as Principle Component Analysis.
Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.
The optimum number of principle components can be found by observing the variance for different sets of components. The set with variance closest to one is concidered as the one with optimum number of principle components.
Here we can observe that the optimum number of components is 12
To simplify the problem, the intraday returns are aggregated as sum and standard deviation for both features (Ret_2 to Ret_120) and target labels (Ret_121 to Ret_180)
Standard deviation of the interday returns is also considered to see how much the returns vary.
After imputing missing values and executing Principle Component Analysis on the numerical data columns, the categorical data was transformed into dummy variable columns using Pandas' get_dummies() feature.
The data was split into training (70%) and testing (30%) data.
I tried two different models:
- Random Forest Regressor: For baseline model
- Multi Layer Perceptron Regressor (MLPReggresor): Since the data involved feature values of different ranges, I thought a Multi Layer Perceptron model will be resistent to those variations
As seen in the graphs above, the prediction lined for Random Forest Regressors are mostly flat lines with a few sparse peaks. While on the contrary, Multi-level Perceptron Regressor shows way better results. Thus only Multi-level Perceptron Regressor underwent hyperparameter tuning. Grid Search Cross Validation method is used to fine tune the regression model. The best model obtained after hyper parameter tuning is:
Mean Absolute Error (MAE) is used the performance metric for evaluating the regression model. MAE is easy to interpret and provides a clear view of the performance of the model. The Mean Absolute Error of the model = 0.01366