Published January 22nd, 2020
Paperback: 372 pages
Publisher: Packt Publishing
Language: English
ISBN: 9781789806311
-
Foreseeing Variable problems in building ML models
- Identifying numerical and categorical variables
- Quantifying missing data
- Determining cardinality in categorical variables
- Pinpointing rare categories in categorical variables
- Identifying a linear relationship
- Identifying normal distributions
- Distinguishing variable distribution
- Highlighting Outliers
- Comparing feature magnitude
-
Missing data imputation
- Removing observations with missing data
- Performing mean or median imputation
- Implementing mode or frequent category imputation
- Replacing missing values by an arbitrary number
- Capturing missing values in a bespoke category
- Replacing missing values by a value at the end of the distribution
- Implementing random sample imputation
- Adding a missing value indicator variable
- Performing multivariate imputation by chained equations, MICE
- Assembling an imputation pipeline with Scikit-learn
- Assembling an imputation pipeline with feature-engine
-
Encoding Categorical Variables
- Creating binary variables through One Hot Encoding
- Performing One hot encoding of frequent categories
- Replacing categories by ordinal numbers
- Replacing categories by counts or frequency of observations
- Encoding with integers in an ordered manner
- Encoding with the mean of the target
- Encoding with the Weight of evidence
- Grouping rare or infrequent categories
- Performing Binary encoding
- Performing Feature hashing
-
Transforming Numerical Variables
- Transforming variables with the logarithm
- Transforming variables with the reciprocal function
- Using square and cube root to transform variables
- Using power transformations on numerical variables
- Performing Box-Cox transformation on numerical variables
- Carrying out Yeo-Johnson transformation on numerical variables
-
Performing Variable Discretisation
- Dividing the variable in intervals of equal width
- Sorting the variable values in intervals of equal frequency
- Performing discretization followed by categorical encoding
- Allocating the variable values in arbitrary intervals
- Performing discretization with k-means
- Using decision trees for discretization
-
Working with Outliers
- Trimming outliers from the data set
- Performing Winsorization
- Capping the variable at arbitrary maximum and minimum values
- Performing zero-coding – capping the variable at zero
-
Deriving features from Dates and time variables
- Extracting date and time parts from datetime variable
- Deriving representations of year and month
- Creating representations of day and week
- Extracting time parts from a time variable
- Capturing elapsed time between datetime variables
- Working with time in different timezones
-
Performing Feature Scaling
- Standardization the features
- Performing Mean Normalisation
- Scaling to the maximum and minimum values
- Implementing maximum absolute scaling
- Scaling with the median and quantiles
- Scaling to vector unit length
-
Applying Mathematical Computations to Features
- Combining multiple features with statistical operations
- Combining pairs of features with mathematical functions
- Performing polynomial expansion
- Deriving new features with decision trees
- Carrying out Principal Component Analysis
-
Creating Features from Time Series and Transactional Data
- Aggregating transactions with mathematical operations
- Aggregating transactions in a time window
- Determining number of local maxima and minima
- Deriving time elapsed between time-stamped events
- Creating features from transactions with Featuretools
-
Extracting features from text variables
- Counting characters, words and vocabulary
- Estimating text complexity by counting sentences
- Creating features with Bag of words and ngrams
- Implementing term frequency-inverse document frequency
- Cleaning and stemming text variables