This project implements an end-to-end machine learning pipeline to detect fraudulent financial transactions.
It is designed as a case-study project focusing on data understanding, feature engineering, class imbalance handling, and model evaluation.
The dataset contains over 6.3 million transactions, making this a realistic, industry-scale fraud detection problem.
- Detect fraudulent transactions with high recall
- Handle extreme class imbalance correctly
- Build a model aligned with real-world fraud behavior
- Provide actionable insights for fraud prevention
- Total Records: 6,362,620
- Total Features: 10
- Target Variable:
isFraud
Fraud.csv– Transaction-level datasetData Dictionary.txt– Description of each column
- Verified that the dataset contains no missing values
- Removed identifier columns (
nameOrig,nameDest) - Handled skewed transaction amounts using log transformation
- Reduced multicollinearity through feature engineering
- Prevented data leakage by excluding
isFlaggedFraud
Key findings:
- Dataset is highly imbalanced (~0.13% fraud)
- Fraud occurs mainly in TRANSFER and CASH_OUT
- Fraudulent transactions exhibit abnormal balance behavior
- Transaction amounts are heavily right-skewed
The following behavior-based features were created:
- Sender balance difference
- Receiver balance difference
- Zero-balance sender indicator
- Large transaction indicator
- Log-transformed transaction amount
- One-hot encoded transaction types
- Used stratified train–test split
- Applied SMOTE only on training data
- Ensured no data leakage into the test set
- XGBoost Classifier
- Handles non-linear fraud patterns
- Performs well on imbalanced data
- Widely used in financial fraud detection
Model performance was evaluated using:
- Recall (Fraud class)
- Precision and F1-score
- ROC-AUC score
- ROC curve visualization
Accuracy was not used as a primary metric due to severe class imbalance.
- Transaction type (TRANSFER and CASH_OUT)
- Sudden depletion of sender balance
- Inconsistent receiver balance changes
- Large transaction amounts
- Transactions from zero-balance accounts
- Real-time transaction monitoring
- Step-up authentication for high-risk transactions
- Adaptive transaction limits
- Periodic model retraining and threshold tuning
- Stores all datasets used in the project
- Raw data is never modified to maintain data integrity
Subfolders:
raw/- Original dataset (
Fraud.csv) - Data dictionary file
- Original dataset (
processed/- Cleaned and feature-engineered datasets (optional / future use)
- Contains all Jupyter notebooks for analysis and modeling
- Notebooks are numbered to reflect the logical execution order
Notebooks:
-
01_data_loading_and_overview.ipynb- Data loading and understanding
- Review of data dictionary and business context
-
02_data_cleaning_eda.ipynb- Data cleaning
- Exploratory Data Analysis (EDA)
- Class imbalance analysis
-
03_feature_engineering.ipynb- Creation of behavior-based fraud features
- Encoding categorical variables
- Reducing multicollinearity
-
04_data_splitting_and_imbalance_handling.ipynb- Feature–target separation
- Stratified train–test split
- Handling class imbalance using SMOTE
-
05_model_training.ipynb- Model training using XGBoost
- Model evaluation using recall, F1-score, and ROC-AUC
- Stores trained machine learning models
- Enables reuse or deployment without retraining
- Stores outputs generated from the analysis
Subfolders:
figures/- Performance visualizations (ROC curve, evaluation plots)
- Provides complete project documentation
- Explains the problem, approach, results, and insights
- Lists all Python dependencies required to run the project
- Ensures reproducibility of the environment
This project demonstrates a real-world, interview-ready approach to fraud detection by combining:
- Strong data understanding
- Behavior-driven feature engineering
- Proper handling of class imbalance
- Industry-standard machine learning models