Skip to content

Latest commit

 

History

History
36 lines (31 loc) · 1.12 KB

dataMiningProcedure.md

File metadata and controls

36 lines (31 loc) · 1.12 KB

Data Mining Procedure

By: Adam Li

This will be a general data mining procedure done on any dataset.

1. Descriptive

  • format of data
  • nans/infs inside data?
  • of features

  • of samples

  • bounds of data
  • categorical
  • labels of data?
  • is there missing data elements? If so, which subjects, or datasets?

2. Exploratory

  • marginal histograms of data features to understand distribution
  • heatmap of features
  • scatterplots/paired scatter plots
  • dimensional reduction -> PCA and scree plots
  • scale data / normalize data
  • transformations (log, log-normalized, square root, square root normalized)
  • covariance matrix between features
  • correlation matrix between features

3. Inferential

  • test if distributions are different from cluster to cluster? -> use ks 2 sample test for test on median
  • test if means are different -> Hotelling's test (multivariate t-test)
  • cluster using k-means and plot BIC/DIC/ARI plots -> check using covariance matrix and scatter plot of the cluster points

4. Prediction

  • define a loss/objective function
  • feature selection using forward/backward
  • random forest
  • logistic regression