This project solves the “Catching Joe” problem, where the goal is to identify whether a browser session belongs to a specific user Joe (user_id = 0) based on browsing behavior.
The dataset contains thousands of browsing sessions from multiple users. Because Joe represents only a very small fraction of the sessions, the task becomes a highly imbalanced binary classification problem.
The system learns Joe’s browsing patterns and predicts the probability that a session belongs to Joe.
Given browser session logs, determine whether a session belongs to Joe.
Each session contains information such as:
- browser type
- operating system
- locale
- location
- list of visited websites
- time spent on each website
- session timestamp
Target variable:
1 → Joe session
0 → Other user session
| File | Description |
|---|---|
| dataset.json | Training dataset with user identities |
| verify.json | Unlabeled sessions used for prediction |
Dataset characteristics:
- ~80,000 browser sessions
- ~400 Joe sessions
- Highly imbalanced dataset (~0.5% Joe sessions)
Problem Understanding
↓
Exploratory Data Analysis
↓
Feature Engineering
↓
Baseline Model Training
↓
Model Improvement
↓
Final Joe Detection System
The task was formulated as a binary classification problem with the objective of detecting Joe’s sessions from browsing logs.
- Extreme class imbalance
- sparse feature
- Behavioral identification from browsing patterns
EDA was performed to understand browsing behavior and identify useful signals.
- session length distribution
- most frequently visited websites
- comparison between Joe and other users
- temporal browsing patterns
- device usage patterns
Initial observations indicated that website visitation patterns were the strongest behavioral indicators.
Raw browser sessions were converted into machine learning features.
Website sequences were transformed using TF-IDF vectorization, converting browsing sessions into numerical vectors.
Example:
mail.google.com slack.com youtube.com
Additional features extracted:
- number of sites visited
- total browsing time
- average time per site
From session timestamps:
- hour of day
- weekday
Categorical encoding for:
- browser
- operating system
- locale
The final feature matrix combines:
TF-IDF website features:
device features
session statistics
temporal features
The baseline model uses Logistic Regression.
Model configuration:
LogisticRegression(
class_weight="balanced",
max_iter=5000
)- Performs well with sparse data
- Efficient for TF-IDF features
- Provides interpretable coefficients
- Precision
- Recall
- F1-score
- ROC-AUC
Several improvements were explored to enhance model performance.
Because of the extreme class imbalance, the classification threshold was tuned to better balance precision and recall.
Grid search was applied to determine the optimal parameters for the Logistic Regression model.
An additional model (LightGBM) was trained for comparison.
- Logistic Regression achieved much higher recall for Joe sessions
- LightGBM missed many Joe sessions
Therefore, Logistic Regression was selected as the final model.
The final prediction pipeline:
Browser Session
↓
Feature Engineering
↓
TF-IDF Vectorization
↓
Device Encoding
↓
Session Statistics
↓
Logistic Regression
↓
Probability Prediction
↓
Threshold Decision
Output: Probability(session belongs to Joe)
Certain websites frequently appeared in Joe’s sessions (e.g., mail.google.com, slack.com, youtube.com).
These recurring patterns acted as behavioral fingerprints.
Joe used multiple browsers and operating systems, so device features were useful only as supporting signals.
Joe’s sessions occurred at various times, making time-based features less significant.
Because the dataset contains sparse features, Logistic Regression performed better than tree-based models.
| Model | Observation |
|---|---|
| Logistic Regression | Very high recall for Joe sessions |
| LightGBM | Higher precision but missed many Joe sessions |
Since the objective is reliable Joe detection, Logistic Regression was selected as the final model.
Example session
mail.google.com → slack.com → youtube.com
Model output
Joe Probability = 0.87
Prediction = Joe
catching-joe/
│
├── data/
│ ├── dataset.json
│ └── verify.json
│
├── notebooks/
│ └── eda.ipynb
|
│
└── README.md
- Python
- Pandas
- NumPy
- Scikit-learn
- TF-IDF Vectorization
- Logistic Regression
- Jupyter Notebook
This project demonstrates that user browsing behavior can be used to identify individuals from browser sessions.
Using TF-IDF website features combined with Logistic Regression, the system successfully learns Joe’s browsing patterns and predicts the probability that a session belongs to Joe.
This approach shows how behavioral data can be leveraged for user identification and anomaly detection in large-scale browsing datasets.