- Process
- Objectives
- Data Collection
- Data Specification
- Data Correction
- Determine Model Validation Scheme
- Data Exploration
- Establish Baseline Performance and Evaluation Metrics
- Feature Engineering and Model Selection
- Training and Tuning
- Model Evaluation
- Model Interpretation
- Reporting and Deployment
- References
Understand utilities. What are the costs of erroneous predictions? What are the benefits of correct predictions? This should (help) determine your scoring function.
Determine Constraints. Do you want predictions? Does the model need to be interpretable? What resources do you have?
Define success. How do you know you're done?
If possible, estimate how much data will be needed to satisfactorily meet the objectives. Alternatively (if more data collection is not possible), determine to what extent the data available will meet those objectives. (Sample size calculations).
Collect the data in raw form, if not provided.
Statistically optimal ways of collecting data. The mantra is: To get the most from your experiments, reduce the variance. (Good 2005) When data collection is expensive, try to do it in the best way possible. Can be used for computer simulations, too. Consider active learning if you need to label data.
What does clean data look like? These properties should be part of the data validation process. Anything in the data that deviates from this specification needs to be corrected or otherwise addressed before modeling. Garbage In, Garbage Out.
What are you assuming about the data? These properties are assumptions we make about the data by what it is supposed to represent (that is, the type and distribution of its corresponding population) or how it was collected (like whether it is an independent sample, identically distributed, etc.) These properties need to be addressed as part of the modeling process.
If the data is collected as part of an ongoing process (like with stock prices, say), we need to be careful about drift. Distributions tend to change over time with changing conditions (regime change).
- Binary: Integer or Logical
- Nominal: Factor
- Ordinal: Ordered Factor
- Continuous: Float
- Linear Temporal: Date
- Cyclic Temporal: Factor
- Text: String
See Statistical Data Type [Wikipedia]
- Feature
- Response
- Identity
- Information
Specify what properties the data must have to be "clean". Domain knowledge is essential here.
Decide on how to correct erroneous data. Understanding why the data is erroneous is important. Visualization tools can help.
Many errors can be corrected through an automatic application of rules. For others, error localization can be used to remove any uncorrectable fields.
Correct observed data before imputing missing data.
Decide on validation procedures (for feature engineering, performance, tuning, benchmarking) and make data splits.
Consider various automated EDA tools. See "The Landscape of R Packages for Automated Exploratory Data Analysis" by Staniak and Biecek.
- Null (Featureless) Model: Simple expectation (E[Y]) for RMSE.
- Best Single Variable Model: (max_{X_i \in X} E[Y|X_i=x_i])
- Naive Bayes: naive lower bound
- Current Performance: practical lower bound
- Bayes Error Estimates: the upper bound of the data set (try estimating by resampling kNN)
- Other Complexity Estimates:
see:
- Setting Expectations [Win-Vector blog]
Understand the data. Use domain knowledge and visualization.
Understand how your data will interact with your algorithms. Be aware of:
- Factor encodings
- Outliers and robustness assumptions
- Missing data
- Statistical assumptions (eg, independence, identical distribution, normality, homoskedasticity)
- Sensitivity to scale
- High correlation
- Rank deficiency (linear dependence)
- Multicollinearity (ill conditioning)
- Noninformative features (regularization)
- Feature interactions and nonlinearity
- High dimensionality
- Computational complexity
- Convergence rates (some algorithms require a lot of data to make accurate estimates)
- Sparsity
Consider representation learning methods:
- the PCA family: linear, nonlinear, kernel, probabilistic, IDA, FA, categorical, MCA, HOMALS
- Autoencoders (which is like nonlinear PCA)
- Response encodings
- Missing value imputation
Data Transforms:
- log transforms
- Box-Cox family
- interactions
- smoothing: splines, kernels
- factor encoding: dummy, response, thermometer, cyclic
- time-series embeddings
Feature Selection:
- Filters
- Wrappers
- Embedded
When you are transforming the data it is important to ask: Is the transformation data-dependent? Does it depend on the features? Does it depend on the response? If so, it ought to be part of a validation procedure. This is important to avoid overfitting. Independent transformations can be applied at will, however.
Consider model aggregation methods: bagging, model averaging, ensembles, SuperLearning. You want a collection of models giving imperfectly correlated predictions. You may be able to reduce hyperparameter optimization and feature selection this way. (Put a sample of models in and let the superlearner select from them.)
If a particular statistic is of interest, consider Targeted Learning.
-
Residuals
-
Permutation Tests: Compare to the same model fit on a randomised response. Can help detect overfitting.
-
Benchmarking
Examining the residuals is very important This can help you to determine whether your model is well-specified. You want the residuals to look like "white noise". Look at QQ-plots or wormplots.
Compare a parametric model to some non-parametric equivalent. If the parametric model is well-specified, it should outperform the non-parametric model. This is because a parametric model should be able to "leverage its assumptions."
see:
- Is Your Model Going to Work? [Win-Vector blog]
On process, see:
- Byrne, C. Development Workflow for Data Scientists. [book]
- Khun, M. Applied Predictive Modeling [book]
- Mount, J., Zumel, N. Practical Data Science with R [book]
- DrWhy. Model Development Process [GitHub]
- CRAN Task View: Project Workflows
- CRAN Task View: Reproducible Research
On infrastructure, see:
- Production Level Deep Learning [GitHub]
- Michelangelo and Scaling Michelangelo and Data Science Workbench [Uber]
On data validation, see:
- de Jonge, E., & van der Loo, M. Statistical Data Cleaning with Applications in R. [book] and the `validate` R package [Github]
On experimental design, see:
- Atkinson, A. C., Donev, A. N., & Tobias, R. D. Optimum experimental designs, with SAS. [book]
- Santner, T., Williams, B., & Notz, W., The design and analysis of computer experiments. [book]
On models generally, see:
- Shalizi, C. R. Data Analysis from an Elementary Point of View. [free book]
- Hastie, T., Tibshirani, R., & Friedman, J., Elements of Statistical Learning. [free book]
- Efron, B., & Hastie, T. Computer Age Statistical Inference. [free book]
- Murphy, K. P. Machine learning: a probabilistic perspective. [book]
- Berk, R. Statistical learning from a regression perspective. [book]
- Mohri, M., Rostamizadeh, A., & Talwalkar, A., Foundations of machine learning. [free book]
On deep learning, see:
On time series, see:
- Hamilton, J. D., Time Series Analysis. [book]
- Sanchez, J., Hidden Markov Models for Time Series - An Introduction Using R. [book]
On feature engineering, see:
- Khun, M., Johnson, K. Feature Engineering and Selection [free book]
- Koch, I. Analysis of Multivariate and High-Dimensional Data. [book]
- Gifi, A., Nonlinear Multivariate Analysis. [book] and an updated (but incomplete) version de Leeuw, J., Mair, P., & Groenen, P., Multivariate Analysis with Optimal Scaling. [book] and the `gifi` [CRAN] R package. PCA style optimal scaling.
- Harrell, F. E., Regression modeling strategies. Chapter 16 [book] and the `acepack` [CRAN] R package. Optimal nonparametric data transforms.
On validation and resampling, see:
- Raschka, S. Evaluation, Model Selection, and Algorithm Selection in Machine Learning. [arXiv]
- Efron, B., & Tibshirani, R., An introduction to the bootstrap. [book]
- Chernick, M. R., Bootstrap methods: a guide for practitioners and researchers. [book]
- Good, P. I., Permutation, parametric and bootstrap tests of hypotheses. [book]
- Lahiri, S. N., Resampling methods for dependent data. [book]
On model interpretation, see:
- Biecek, P. and Burzykowski, T., Predictive Models: Explore, Explain, and Debug. [book] and DrWhy [GitHub] R package
On statistics and mathematics, see:
- Jaynes, E. T., Probability theory: the logic of science. [book] A bit sprawling to serve as an introduction, though has much food for though on Bayesian probability.
- Casella, G., & Berger, R. L., Statistical Inference. [book]
- Deisenroth, M. P., Faisal, A. A., & Ong, C. S., Mathematics for Machine Learning. [free book] Calculus, linear algebra, probability, optimization, with applications to common models. Quite good.
- Gallier, J., & Quaintance, J., Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning. More advanced and more comprehensive. The first author has a number of other relevant textbooks on his website. [free book]
- Pollard, D., A user's guide to measure theoretic probability. A good introduction to measure theory, if you're into that sort of thing. [book]
- Schervish, M. J., Theory of Statistics. Like Casella with measure theory. [book]
On graphics, see:
- Murrell, P. R Graphics. [book]
- Hadley, W. ggplot2. [free book]