In this project, we implement Logistic Regression algorithm with Python to predict whether an internet user will click on an advertisement. We build a binary classification model using the Advertising Click dataset, which contains various user features and their interaction with ads.
- Introduction to Logistic Regression
- Problem Statement
- Data Overview
- Data Preprocessing
- Exploratory Data Analysis
- Model Building
- Model Evaluation
- Results and Conclusion
- References
Logistic Regression is a fundamental classification algorithm in machine learning used for binary classification problems. Despite its name containing "regression," it is primarily employed for classification tasks where the target variable is categorical.
- Sigmoid Function: Maps any real value to a probability between 0 and 1
- Decision Boundary: Threshold (typically 0.5) that separates classes
- Cost Function: Cross-entropy loss instead of mean squared error
- Probability Output: Returns probabilities that can be interpreted as confidence scores
Business Question: Can we predict whether a user will click on an advertisement based on their demographic and behavioral characteristics?
Objective: Build a binary classifier using Logistic Regression to predict ad clicks (Clicked on Ad = 0 or 1
) based on user features like time spent on site, age, income, and internet usage patterns.
The dataset contains the following features:
- 'Daily Time Spent on Site': Consumer time on site in minutes (continuous)
- 'Age': Customer age in years (continuous)
- 'Area Income': Average Income of geographical area of consumer (continuous)
- 'Daily Internet Usage': Average minutes per day consumer is on the internet (continuous)
- 'Ad Topic Line': Headline of the advertisement (categorical)
- 'City': City of consumer (categorical)
- 'Male': Whether consumer was male (binary: 0 or 1)
- 'Country': Country of consumer (categorical)
- 'Timestamp': Time when consumer clicked on Ad or closed window (datetime)
- 'Clicked on Ad': Target variable (binary: 0 or 1)
- Handling Missing Values: Identify and treat missing data
- Feature Engineering:
- Extract time-based features from timestamp (hour, day of week, month)
- Create relevant aggregates if needed
- Categorical Encoding:
- One-hot encoding for 'Country', 'City', 'Ad Topic Line'
- Label encoding for ordinal categories if any
- Feature Scaling: Standardize/Normalize numerical features
- Train-Test Split: Split data into training and testing sets (typically 70-30 or 80-20)
- Target Variable Distribution: Balance check of clicked vs non-clicked ads
- Correlation Analysis: Relationship between features and target variable
- Feature Distributions:
- Age distribution by click status
- Time spent on site vs click rate
- Income levels and click behavior
- Internet usage patterns
- Categorical Analysis:
- Click rates by country
- Gender differences in click behavior
- Ad topic performance
- Correlation heatmap
- Distribution plots for numerical features
- Count plots for categorical features
- Box plots showing feature distributions by click status
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Preprocessing
X = df.drop('Clicked on Ad', axis=1)
y = df['Clicked on Ad']
# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
# Model training
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
# Predictions
y_pred = logreg.predict(X_test)
y_pred_proba = logreg.predict_proba(X_test)[:, 1]
- Basic Logistic Regression: With default parameters
- Regularized Models: L1 (Lasso) and L2 (Ridge) regularization
- Hyperparameter Tuning: Using GridSearchCV for optimal parameters
- Accuracy: Overall correctness of predictions
- Precision: Quality of positive predictions
- Recall: Coverage of actual positive cases
- F1-Score: Harmonic mean of precision and recall
- ROC-AUC: Area under ROC curve measuring separability
- Confusion Matrix: Detailed breakdown of predictions
- Baseline Performance: Compare against random/majority class classifier
- Cross-Validation: Ensure model generalizability
- Feature Importance: Identify most influential predictors
- Learning Curves: Check for overfitting/underfitting
-
Model Performance:
- Expected accuracy: > 80% (given the nature of advertising data)
- Key drivers: Time spent on site, daily internet usage, age likely to be strong predictors
-
Business Insights:
- Demographic segments most likely to click ads
- Optimal time for ad displays
- Behavioral patterns of engaged users
-
Model Limitations:
- Potential missing features (user interests, device type, etc.)
- Temporal changes in user behavior
- Privacy considerations in feature usage
The Logistic Regression model provides a interpretable and efficient solution for predicting ad clicks. The coefficients offer direct insights into feature importance, making it valuable for marketing strategy decisions.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning
- Scikit-learn Documentation: Logistic Regression
- Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression
- Google Analytics documentation for digital advertising metrics
- Industry reports on digital advertising click-through rates
This project demonstrates the practical application of Logistic Regression in digital marketing analytics, providing actionable insights for advertising optimization and user engagement strategies.