Table of contents
- Data-Science-for-Business-Techincal
- 1 Intro to Data Science
- 2 Managing Data
- 3 Decision Tree
- 4 Fitting and Evaluation of Models
- T5 Evaluation ⭐️
- T6 Linear Classification & Regression
- T7 Feature Engineering & Variable Selection
- T8 Similarity, Neighbors and Clustering
- Topic 9 Modeling Text Data
Data Science Process: CRISP
CRISP: Cross-Industry Standard Process
Business understanding, Data Understanding, Data Preparation, Modeling,
Evaluation: RMSE, Accuracy, ROC Curves; Deployment
Predictive vs Descriptive Model
- Clustering is a descriptive model, while most DS algos are predictive.
Supervised vs Unsupervised
- Unsupervised learning: Customer segmentation via clustering
- Supervised learning: Predicting Customer Spending
- Classification, regression, time series, etc.
Find the Data
We can use scrapers, parsers, companies’ APIs, and other tools like “beautiful soup” to find data.
Exploring Data
Exploratory Data Analysis is prior to any modeling
- Summaries, outliers, missing data, correlations[-1,1]
Types of Data
Structured, unstructured, time series, text, image, media, spatiotemporal, network-based…
Feature types
-
Numeric
- Numbers, dates
-
Categorical/Nominal
- Multi-valued, numeric labels
- Ordinal: (Ordered), Income
- Nominal: Eyecolor, gender, pets..
-
Binary
-
Datestamps
use
import datetime
Some data operations in python
- rename var -
.rename
- slices of data -
.loc
- summaries of var -
.info
.describe
- categorical var -
.value_counts
- explore data -
.head
.tail
- merge or split attributes -
.concat
,.split
- If data too large -
.sample
Categorical Variables
- Recode variables as binary
pandas.get_dummies(drop_first = TRUE)
sklearn.preprocessing.OneHotEncoder
- When categories is too many, we can transform them into top levels + “other”
- Outliers should always be considered and inspected to see if they are “real” or some artifact of data collection
- Python uses
NaN
for missing data, useisnull()
andnotnull()
to detect missing values - Reasons for missing data
- data entry errors, non-responses in surveys, system errors, etc.
Why identifying missing values is crucial?
- Data integrity, quality analysis, model performance
Strategies for missing data
- Deletion methods
- Listwise: remove entire rows where any value is missing,
dropna()
- Pairwise: use available one, ignore the missing one
- Listwise: remove entire rows where any value is missing,
- Imputation methods
- Mean/Median/Mode imputation,
fillna()
- Predictive modeling
- Use algos like regression or KNN to predict and fill missing values
- Multiple imputation
- A more advanced technique that imputes values multiple times to account for the uncertainty of missing data
- Mean/Median/Mode imputation,
Plot Type | Pros/Advantages |
---|---|
Bar Chart | Easy to compare different groups; suitable for displaying differences in quantities among categorical data. |
Line Chart | Ideal for showing trends over time; can display multiple series simultaneously; clear visualization of changes and patterns. |
Scatter Plot | Good for visualizing the relationship between two continuous variables; can identify correlations and outliers. |
Histogram | Useful for showing the distribution of a continuous variable; helps in understanding the data distribution. |
Box Plot | Excellent for displaying data distribution based on a five-number summary; useful for detecting outliers. |
Pie Chart | Effective for showing relative proportions of a whole; best with limited categories. |
Heat Map | Good for showing magnitude as color in two dimensions; useful for correlation matrices, geographical data. |
Area Chart | Similar to line charts but emphasizes magnitude; useful for comparing quantities over time. |
Violin Plot | Combines box plot and kernel density plot; offers deep insight into distribution density. |
Bubble Chart | Similar to scatter plot but adds a dimension with bubble size; useful for comparing three data variables. |
Stacked Bar Chart | Shows total and relationships between parts; useful for comparing cumulative effects. |
Boxplots IQR(Interquartile Range) = 75% - 25%
Acceptable Range = 1.5 * IQR, data that fall outside of this range considered outlier.
Categorical, 3 ways to present contingency tables
- Stacked bar
- Side-by-side bar
- Standardized bar
Categorical and numeric
Side by side box plots allow you to compare numeric distributions across variables which are categorical.
Scatterplot
Scatterplot is the standard tool to display relation between 2 variables. Find correlations, patterns and outliers.
Decision tree is also called classification tree.
The leaf has 0 or 100% probablity: Laplace Correction, p(c) = (n+1)/(N+2)
Good split: “purity” of the branches, “coverage” of data
- The more homogeneous (consistent) a group is, the more pure it is.
- Entropy: - p1 log2(p1) - p2 log2(p2)
- Entropy is maximized when classes equal
- A decrease in entropy means more info
- Overall entropy of split is a weighted average of the two
- split by different features can bring different purity
Information Gain = Impurity(parent) - impurity(children)
We will stop growing a tree, when:
- we have perfect purity
- No split add more information
Recursive algorithm that keeps splitting until done
Every iteration: pick split over all current leaf nodes over all possible attributes - uses information gain as a measure of improved purity
Complexity vs. Error
Overfitting: model is too complex: the model fits great On the training data but doesn’t generalize to holdout data because it is "memorizing" the training data.
- The increase of model complexity will decrease the error on training set, but not necessarily on testing/holdout set.
- We can use the complexity tradeoff to find the best size for a tree
Cross-validation
-
Split into K-folds
from sklearn.model_selection import cross_val_score cross_val_score(model, X, Y, scoring="accuracy", cv=5)
-
LOOCV(Leave One Out Cross Val)
- Fit one datapoint each time, N time, best est. True Error
- Can be time consuming
from sklearn.model_selection import LeaveOneOut, cross_val_score loo = LeaveOneOut() cross_val_score(model, X, Y, scoring="accuracy", cv=loo)
In cases where accurate estimation of the performance is critical, it may be useful to create a third partition of the data
- 60%training, 20% test, 20% validation
- or a classification model, the cutoff used has an impact on the overall error
- False positive (FP): we predict “positive” when it is not true
- False negative (FN): we fail to predict “positive” when we should have
A way is a way of presenting the FN and FP rates for a given prediction
- accuracy = (TP + TN) / n
- different cutoff generates different confusion matrix
- Loop thru different cutoff to get different matrices and accuracy, and give the best cutoff
-
Accuracy: (TP+TN) / n
-
Error Rate: (FP + FN) / n
-
Precision(Exactness): TP / predict_true
- What % of times we label positive are we correct?
-
Recall: TP / Base+
- how many of the positives did we predict?
-
Accuracy: (90+9560)/10000 = 0.965
-
Error: 350 / 10000 = 0.35
-
Precision: 90/230 = 0.39
-
recall: 90 / 300 = 0.30
Base rate: all true cases / all cases
We have more than accuracy to eval the model, so to avoid
ROC: A Receiver Operating Characteristic (ROC) curve is a graph that shows the performance of a binary classification method. It plots TPR vs. FPR.
AUC: "Area under the ROC Curve"
What is ROC and AUC? The ROC Curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
What is ROC curve plotting? It is created by plotting the true positive rate (TPR, also known as sensitivity) against the false positive rate (FPR, 1 - specificity) at various threshold settings.
What is TPR vs FPR? Different applications may require different trade-offs between sensitivity (True Positive Rate) and specificity (1 - False Positive Rate). For example, in medical diagnostics, you might prefer a higher sensitivity to ensure all positive cases are identified, even at the cost of more false positives. The ROC curve allows you to visualize these trade-offs and choose a threshold that best meets the specific needs of the application.
What is AUC? The Area Under the Curve (AUC) is a performance metric for binary classification models. This metric provides an aggregate measure of performance across all possible classification thresholds, AUC ranges from 0 to 1. Specifically, it refers to the area under the ROC curve. The curve provides a tool to select the best model threshold for balancing sensitivity and specificity. A higher curve towards the upper left corner indicates a better performance, which means an AUC closer to 1.
The Lift Curve is used in predictive modeling to show how much better one can expect to do with the predictive model comparing to a random choice. What is base rate? The term "base rate" in the context of predictive modeling and statistics refers to the underlying probability of a particular class in the data without considering any other factors or features.(e.g., if you are predicting fraud in a dataset where 2% of transactions are fraudulent, then the base rate of fraud is 2%).
When will base rate work best? Knowing the base rate can help in choosing the right modeling techniques and metrics. For example, in cases of a very low base rate (highly imbalanced data), standard accuracy might not be a meaningful performance metric because a model that always predicts the majority class would appear highly accurate but is practically useless. Metrics like precision, recall, F1-score, or the use of ROC curves become more relevant in such cases.
What is the difference between ROC and Lift Curve? Unlike the ROC curve that primarily evaluates binary classification problems, the lift curve helps in both binary and multiclass classification scenarios, particularly in direct marketing.
What is life curve plotting? The lift curve plots the ratio of positive responses (or another target metric) gained by using the model against a baseline of random responses as a function of the percentage of cases examined.
Comparing the Three
- AUC is a single scalar value summarizing the overall ability of the test to discriminate between the positive and negative classes across all thresholds.
- ROC Curve provides a comprehensive visual representation of a classifier's performance at all thresholds, letting analysts choose a threshold that balances sensitivity and specificity according to the business context.
- Lift Curve focuses more on the effectiveness of a predictive model in terms of "lifting" the results over a random model, which is particularly useful in campaign targeting and resource optimization.
- Each of these metrics and curves has its specific use cases and helps in different aspects of evaluating the performance of classification models.
Linear classification is achieved via Linear Discriminant: Class(x) = 1._ 2._
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
SVM
The optimal SVM is the boundary that maximizes the margin.
from sklearn import svm
Complexity parameters of SVM:
- C: The cost. Higher C fits training data more closely
- Kernel: Linear or non-linear with degree of polynomial
Regression Models
Prediction variable is numeric - continuous. Decision Tree is a classification models, NOT regression.
Regression Trees
from sklearn.tree import DecisionTreeRegressor
Linear Regression
from sklearn.linear.model import LinearRegression
- Linear Regression is to predict a numeric variable from one or multiple variables(which can be numeric or categorical)
Linear Relationships
Correlation coefficient r_xy
:
- 0 means no relation, -1 means negative, 1 means postive relationship
- Least squares regression line is the line that minimizes the sum of squared residuals (Make errors don’t cancel out)
- use
statmodels
to get regression output -
$\beta_0$ is the Y-intercept. It is not reliable if it’s outside the range of data (Extrapolation) - The interpretation of beta_1 can’t be casual
Regression concerns
- Violations of assumption: linear relationship, normality of residuals, constant variance
- Regression can be sensitive to: outliers, leverage points, highly correlated predictors
Logistic Regression
from sklearn.linear_model import LogisticRegression
Loss function
Linear Regression - Squared-error loss, Logistic Regression - Logistic Loss
Complexity
More attributes means more complex relationships. Categorical variables can explore dimensionality.
Reduce complexity: 1. Reduce the number of attributes(Variable selection); 2. Use a penalty in the objective function(Regularization).
Regularized Regression: The least squares regression line is the line that minimizes the sum of the squared residuals.
Ridge/L2 regression: Estimates are squared.
Lasso/L1 regression: Estimates are absolute value.
λ is the penalty parameter, protect us from overfitting.
Regularization is also referred to as shrinkage.
λ = 0 means no regularization, larger is stronger penalty, can be determined by cross-validation.
Lasso Regression: the lasso will “shrink” parameter estimates towards zero, has same effect as variable selection.
- Add attributes: feature construction, feature engineering
- Remove attributes: dimension reduction, variable selection
Example1: Predict whether a customer will respond to a special offer, based on previous purchases
- We want every entry to be every people, and to avoid redundancy
- Eg. Turn 10 purchases entry into 1 purchase sum amount
Example2 Netflix prize: Predict if someone’s gonna like Jurassic Park
Things to do before modeling
- Categorical → dummies, numeric → categorical: highest 10%, or H/M/L categories, binning of target or attribute for long tail
- strings → date variables, extract year/month/day from data
- Combine columns: mean, max, min or total might be more relevant
- Standardize features:
StandardScaler
: Converts to Z-Scorel, MinMaxScaler, to [0,1]
Example3: Network data: Top 2 buddies were more predictive than anything else
-
Why removing unimportant attributes?
Model accuracy, overfitting, efficiency, interpretability, unintended data leakage
Feature selection by Addition
Identify each feature’s impact: correlation, ROC curve, information gain, regression table - find all of the significant features and remove non-sigs
- Can also do this iteratively, if there is collinearity
Iterative feature subtraction
Backward elimination(Regression): Improve
-
In every iteration, when the process suggests removing a categorical variable, it means removing all its dummy variables (levels) at once, not just one(e.g. Removing is_red, is_blue, is_fellow all together for color)
-
L1 Regularization is another way to reduce the number of features
<img src="assets/lasso.png" width=250">
Use CV to find the best value of lambda, fit data with optimal lambda, remove features w/ coefficient = 0
Example online advertising: Dstillery, ad targeting inc in NYC
How to solve cold start? where we don’t have data yet to train on? What to do if there are very few positive examples? Reducing attribute space?
- Feature reduction - Clustering: Dstillery bought data that allowed them to categorize all URLs into topics
Dimension Reduction - Principle Components
- Principal Components Analysis(PCA): High dimensional space → lower
- Take original data and reduct to top principle components, fit regression on PC instead of original data(Variable should be normalized for PCA)
Use case of similarity in business: Find similar products to existing products to identify recommendations; Identify items/customers similar to your known. best customers
- Unsupervised learning: Use similarities. to group similar items into clusters
Distance
We use normally Euclidean distance to define distance, and we turn non-numeric data into numeric data also for creating distances
Different types of similarity/distances
- Numeric Distance: Euclidean, Manhattan, Cosine distance
- Binary vectors: Hamming distance (# of changes from A to B))
- Jaccard similarity: Intersection over union
kNN is a supervised classification algorithm that ignores global structure and simply looks at similarities.
sklearn.neighbors.KNeighborsClassifier
Steps:
- Calculate the distance between the target and all examples in the training set
- Select K examples closest to target in the training set
- Assign target to the most common class among its K nearest neighbors
- Note that our answer will also depend on the K we choose
Scaling data before calculating the distance
- Distance between neighbors will be dominated by attributes with relatively large values
- Solution - Normalize numeric variables
- Scale data between 0(min) and 1(max) - Achieve equal influence
- Create z-scores by subtracting mean and dividing by standard dev
from sklearn.preprocessing import MinMaxScale StandardScalar
KNN is simple to use, and can be used with multiple-class targets, but not efficient and not the best.
Complexity
Small k
tends to overfit the data. (Dashed line is the “Bayes Rate”)
- In optimizing
k
using cv, 1/k is used as the X-axis, since largek
correspond to low complexity
kNN as a regression model
Find the right k
via cross-validation: us RMSE instead of error rate. sklearn.kNeighborsRegressor
Distinguish kNN and clustering:
- kNN vs Clustering
- Supervised vs unsupervised; Output label or value vs. output sets of clusters; classification or regression vs. exploring patterns; No fitting models vs need training model
Clustering is used: Customer segmentation, recommendation algorithms, ident of fraud or risk, clustering documents for information retrieval
Two basic types of clustering: centroid based(aka partitional); Hierarchical
Clustering assumes that each observation falls into only one cluster.
- All data can be clustered, but clusters don’t always make sense!
K-means clustering sklearn.cluster.KMeans
(center-based)
- Pick the # of clusters - k
- Start with some guess as to the cluster centers
- Iterate between: classifying each points to clusters; recalculate cluster centers(centroids)
- Check out
Module6_similarity/kmeans_clustering.ipynb
Discussion: K-means clustering applies to fairly large data sets, sensitive to initial centers, converges to local optimum, K choice can be subjective
Hierarchical Clustering
- Start with n clusters, and progressively merge data until you have one cluster
Dendrograms
- Dendrograms are a visual representation of the hierarchical clustering process, and closets points get connected lower in the dendrogram.
- You can partition your data into any number of clusters in [1,n] by different result cuttings
HC tends to have low stability, results depend on linkage(minimum), distance, normalization, etc.
Checkout Module6_Similarity/HierClust.ipynb
Clustering Beware
Hard to evaluate the output; Reveal clusters even if there’re no clusters in the data
Case Study: AT&T Cellular users
- Apply clustering algorithms to find user groups which share similar patterns of usage intensity over time.
- In Apply clustering algorithms to find user groups which share similar patterns of usage intensity over time. Look at the centroids of the clusters. Do they make sense? Look at the centroids of the clusters. Do they make sense?