I built this project to classify phishing emails based on the text of the email using machine learning. I trained a logistic regression model.
- The training dataset contains 8,348 labeled examples
- The test dataset contains 1,000 unlabeled examples
Each email has the following columns:
- id
- subject line
- email (body text).
- values (a label where 1 is a phishing email and 0 is a legitimate email)
-
EDA, Feature Engineering
- Extract features from email text, visualize.
-
Modeling
- Apply logistic regression for classification.
- Use cross-validation to evaluate model performance and prevent overfitting.
-
Evaluation
- acuraccy, precision, recall, F-1 score.
-
Test on unlabeled data
The dataset is from a Data Science class at UC Berkeley.