The purpose of this project is to design and train an ANN capable of detecting changes in time series, more specifically Internet RTT measurements.
There are three datasets included in this repo: training (artificial), validation (artificial), validation (real RTT). The artifical RTT trace generator come from a previous project with a bit tunning.
The modles are described in model.py. They can be trained against training (artificial) with cpt_train.py and evaluated against validation (artificial) and validation (real RTT) with cpt_eval.py.
*.json and *.h5 files are trained models. *.png with same file name prefix give the model strucutre. _*learning_curve.pdf plots the learning curve while training these models.
All thoese regard data loading, formating, processing functions can be found in data.py.
At the beginning, online detection was intended. The sequential input and output behaviour of LSTM would be a wonderful enabler. The ideal would be delay the output by certain lags/steps compared to the input sequence, so that the model can leverage the current inputs to tell whether a change happened in the near past. It is pretty alike the many-to-many layout specified in this article. However, it doesn't seem easily achievable with keras, accoridng to this post.
In order that the model can be compared to cpt.np, a non-parametric bayesian method, in a fair manner, the main desgin objectifs (apart from change detection) are the following:
- offline detection that digests the entire input sequence.
- capable of handling input sequence of arbitrary length.
The above figure outlines the structure of the model_1 described in model.py.
It has two inputs. input_seq takes the input data in form of sequence, while input_array take the same data in form of array in one go. The input sequence length is hardcoded to 100.
It has two outputs. aux_out indicates the input sequence experiences at least one change when its value draws close to 1. main_out is of same size as input sequence. It colors each datapoint with 1 and 0. Changepoints can be further deduced from continuous datapoints of same color, a segment. For example, if the main_out gives [0.01, 0.02, 0.01, 0.99, 0.98, 0.99, 0.03, 0.1, 0.02], it indicates changes proabably happen at [3,6].
The detailed data processing procedure can be found in data.py.
When using this model to process sequences of varying length, following steps implemented in changedetectRNN.py are taken:
- if input length is smaller than 100, pad it with its last value; else cut it into chunks of 100 in length;
- for each chunk sequence, substract max(0, min(chunk));
- do the detection individually for each chunk;
- if aux_out indicates change, convert main_out to changepoints;
- concatenate all changepoints.
The model is trained against 5000 artificially generated sequences, each of 100 in length.
The generator can be found here. In order to produce relatively short sequences, the lower bound of the stage length is changed. 10 instead of 50 is used to generate both the training and validation dataset in this study.
Some stats regarding the training dataset:
- 3557 out of 5000 sequences ever experienced a change;
- for sequences experienced changes, in average 39.6% datapoints are expected to be colored to 1.
- All sequences mixed, 28.1% datapoints are expected to be colored to 1.
The above numbers suggest that the traning data are relatively well balanced.
At the end of 200 epoch, the following result is reached:
loss: 0.3210 - aux_out_loss: 0.0168 - main_out_loss: 0.3193 - val_loss: 0.3420 - val_aux_out_loss: 0.0275 - val_main_out_loss: 0.339
binary cross entropy was used as loss function for both output.
aux_out is somewhat OK with slight overfitting. main_out is still far from ideal and dominates the total cost. The entire learing curve can be found at cpt_model_1_learning_curve.pdf.
The model is first validated against the validation dataset containing as well 5000 sequences of 100 in length. These sequences are generated independant of traning dataset following the same parameters.
Following performance is achieved:
output | binary cross entropy | binary accuracy |
---|---|---|
aux_y | 0.031 | 0.994 |
main_y | 0.348 | 0.800 |
The performance of aux_y is somewhat statisfying, yet there is plenty of room for improvements regarding main_y.
Then the model is applied to real RTT timeseries with labels, real_trace_labelled. The dataset comes from a previous project. The detection result is compared to cpt.np in terms of precision and recall. These two metrics are calcualted on a per sequence base. Precision is the percentage of detected changepoints that are actually relevant. Recall tells the percentage of real changepoints that are successfully detected.
Precision and recall is only meaningful on the real RTT dataset, since the timeseries here are relatively long around ~5000 in length, and having ~40 changepoints each. Or we can as well generate ralatively long artifical sequences for the evaluation based on these two metrics (TODO).
Red dots are the results of cpt.np, while green triangles are from our model. In short, the current model is still far from satisfying compared to the state-of-art method. Especially in terms of precision.
The conversion from segment color to changepoint might have further magnify the errors. Or either, the segment color data representation is questionable, as it dilutes the attention on the edge of colored segements.
TODO