By: Adam Li
This will be a general data mining procedure done on any dataset.
- format of data
- nans/infs inside data?
- bounds of data
- categorical
- labels of data?
- is there missing data elements? If so, which subjects, or datasets?
- marginal histograms of data features to understand distribution
- heatmap of features
- scatterplots/paired scatter plots
- dimensional reduction -> PCA and scree plots
- scale data / normalize data
- transformations (log, log-normalized, square root, square root normalized)
- covariance matrix between features
- correlation matrix between features
- test if distributions are different from cluster to cluster? -> use ks 2 sample test for test on median
- test if means are different -> Hotelling's test (multivariate t-test)
- cluster using k-means and plot BIC/DIC/ARI plots -> check using covariance matrix and scatter plot of the cluster points
- define a loss/objective function
- feature selection using forward/backward
- random forest
- logistic regression