This is a problem of machine learning. The dataset is composed by the timestamps (without frequency), the target (Nest Weight) and the exogenous variables (Temperature, Humidity, Luminousity and so on) I started with a statistical analysis on the variables, first not considering the time and then considering it. I removed only the evident outliers and not all of them because these are real data from real sensors (so anything but perfect). At this point I studied the time series with some analysis like additive seasonal decompose, autocorrelation plot, ADF test and lag plot. All variables turned out to be stationary except Nest Weight (it becomes stationary after a differencing). Now I proceeded with the machine learning. I used two models (tree-based model with library 'XGBoost' and SARIMAX). Starting with the boosted trees: The problem was that with this model you cannot train with a number of variables and then test with less variables (so training with exogenous and test only with timestamps). So the first model uses as predictors only the variables taken from the timestamps. As target I didn't use the Nest Weight variable but the differenced time series (because in time series forecasting is in our best interest to use a astationary time series). At this point I trained the model with hyperparameters' tuning and cross validation (3-folded because the dataset is quite small for a machine learning problem). Selected the best model I runned again the cross validation to see the results (obviously transforming the predictions of the differenced series in predictions of the original time series). They are quite good (considering the MSE as metric). In the second set of validation there is one of the two unexplained drop that the model cannot predict. So this increased the MSE but the other two sets give good result. At this point we wonder if the exogenous variables would improve the results if they were available also for the testing set. So we trained first a model to predict the exogenous variables for a part of the dataset and then with this new dataset I runned the same model as before with tuning and cross validation. Again, selected the best model I plotted the results of the cross validation and they were slightly better than the ones of the previous analysis. Again the model couldn't figure out how to predict the drop so the average MSE is quite high. I predicted the exogenous variables because they are stationary so easy to predict. The SARIMAX model was used because they asked us to use it. This model was trained without exogenous variables. The target was always the differenced target. Now our dataset has no frequency, unlike most of the time series. For the boosted trees this is not a problem but for SARIMAX it is. So I used a simple method of resampling and interpolation (with a linear method) to generate a dataset with frequency 15 minutes (because this was the frequency the dataset was meant to have). At this point I used the tuning with cross validation to select the best model. Unfortunately, although the result is a lower MSE, the predictions were a flat line (so the model didn't catch trend and seasonality of the time series). But obviously a model so elaborate like sarimax return a flat line if the MSE is so low.
-
Notifications
You must be signed in to change notification settings - Fork 0
Code for an university exam. The dataset is about a insects nest and the monitoring on his weight. There are different environment measures as variables and the focus is on nowcasting the weight using the environment variables as exogenus. The models used were XGBoost and SARIMAX with a comparison of the results at the end.
gianlucatut16/Time-series-Analysis
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Code for an university exam. The dataset is about a insects nest and the monitoring on his weight. There are different environment measures as variables and the focus is on nowcasting the weight using the environment variables as exogenus. The models used were XGBoost and SARIMAX with a comparison of the results at the end.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published