This repository contains data preprocessing and standard EDA tools for multi-variate time-series data (panel data) to get started with quick EDA on any new dataset that one may receive. It also optionally provides the framework to analyze any associated meta-data files related to the timesereis data (such as raw materials used in each batch, patient demographics, etc)
The usage of the tool is very simple.
- Place your data in the
./data
folder. - Open
mv_ts_EDA.ipynb
and edit theconfig
dict at the top - Press RunAll for sequential execution of all the analysis.
The following series of steps are performed in the end-to-end run of the NB:
-
Read Data : Read data (seq + static) and display basic statistics
-
Missing Value Treatment: Check for NaNs and treat them. Two treatment options are available:
2.1 Drop (drop
): Drops all the rows with NaNs. If the percentage of NaNs is low, this can be used. Otherwise, significant data size reduction may occur.
2.2 Impute (impute
) : Imputes the NaNs using either Median or Mean -
Duplicates Removal: If any duplicate rows exist, they are dropped.
-
Outlier Detection : Identifies outliers in the target variable. Removes all data related to those identifiers (i.e., from both static and sequential data). Two detection methods are provided:
4.1 IQR-based Removal (iqr
) : Removes outliers based on the inter-quartile range. To Do: Add an option to change IQR limits (currently fixed at 99%).
4.2 Z-score-based Removal (z-score
) : Removes outliers based on the Z-score method, where values beyond a specified threshold (e.g., |Z| > 3) are considered outliers. To Do: Add an option to adjust the Z-score threshold.
- Sequence Length distribution: since we have panel data, it is good to check the variable sequence lengths to know the min, max and median in order to decide on a reasonable max_seq_len for padding later.
-
Time-series decomposition: to decompose the various time-series variables into trend, seasonality and residuals to understand the nature of underlying data better (such as whether we have more of statitionary series, etc to drive modeling decisions accordingly).
-
Univariate Analysis: You can explore the univariate properties of time-series variables to understand their distributions and spread of values. It also helps to understand the different scales of each variable. Currently, this can be done using two methods:
7.1 Kernel Density Estimates (KDEs): Visualizes the probability density function of numerical variables, providing insight into their distribution and smooth variations.
7.2 Box plots: Displays the spread, central tendency, and potential outliers of numerical variables, highlighting differences in scale and variability.
- **Bi-variate correlations ** : look at the pairwise Pearson correlation coefficient between variables ro reveal positive/negative or strong/weak correlations between the variables or of the variables with the target variable.
[NOTE] : if you have static meta-data also available alongwith the time-series variables, you can check for the cross-correlation between meta-data variables and the time-series variables too in the provided code. This helps to understand how the conditional variables impact the subsequent time-series patterns (for instance, demographics of patients affecting their medical journeys)