Skip to content

End-to-end machine learning project for detecting fraudulent financial transactions

Notifications You must be signed in to change notification settings

NarayanKabra21/fraud-detection-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

💳 Fraud Detection System using Machine Learning

Project Overview

This project implements an end-to-end machine learning pipeline to detect fraudulent financial transactions.
It is designed as a case-study project focusing on data understanding, feature engineering, class imbalance handling, and model evaluation.

The dataset contains over 6.3 million transactions, making this a realistic, industry-scale fraud detection problem.


Business Objective

  • Detect fraudulent transactions with high recall
  • Handle extreme class imbalance correctly
  • Build a model aligned with real-world fraud behavior
  • Provide actionable insights for fraud prevention

Dataset Description

  • Total Records: 6,362,620
  • Total Features: 10
  • Target Variable: isFraud

Data Files

  • Fraud.csv – Transaction-level dataset
  • Data Dictionary.txt – Description of each column

Data Cleaning & Preparation

  • Verified that the dataset contains no missing values
  • Removed identifier columns (nameOrig, nameDest)
  • Handled skewed transaction amounts using log transformation
  • Reduced multicollinearity through feature engineering
  • Prevented data leakage by excluding isFlaggedFraud

Exploratory Data Analysis (EDA)

Key findings:

  • Dataset is highly imbalanced (~0.13% fraud)
  • Fraud occurs mainly in TRANSFER and CASH_OUT
  • Fraudulent transactions exhibit abnormal balance behavior
  • Transaction amounts are heavily right-skewed

Feature Engineering

The following behavior-based features were created:

  • Sender balance difference
  • Receiver balance difference
  • Zero-balance sender indicator
  • Large transaction indicator
  • Log-transformed transaction amount
  • One-hot encoded transaction types

Handling Class Imbalance

  • Used stratified train–test split
  • Applied SMOTE only on training data
  • Ensured no data leakage into the test set

Model Development

Model Used

  • XGBoost Classifier

Why XGBoost?

  • Handles non-linear fraud patterns
  • Performs well on imbalanced data
  • Widely used in financial fraud detection

Model Evaluation

Model performance was evaluated using:

  • Recall (Fraud class)
  • Precision and F1-score
  • ROC-AUC score
  • ROC curve visualization

Accuracy was not used as a primary metric due to severe class imbalance.


Key Fraud Drivers

  • Transaction type (TRANSFER and CASH_OUT)
  • Sudden depletion of sender balance
  • Inconsistent receiver balance changes
  • Large transaction amounts
  • Transactions from zero-balance accounts

Fraud Prevention Recommendations

  • Real-time transaction monitoring
  • Step-up authentication for high-risk transactions
  • Adaptive transaction limits
  • Periodic model retraining and threshold tuning

Project Structure

1️⃣ data/

  • Stores all datasets used in the project
  • Raw data is never modified to maintain data integrity

Subfolders:

  • raw/
    • Original dataset (Fraud.csv)
    • Data dictionary file
  • processed/
    • Cleaned and feature-engineered datasets (optional / future use)

2️⃣ notebooks/

  • Contains all Jupyter notebooks for analysis and modeling
  • Notebooks are numbered to reflect the logical execution order

Notebooks:

  • 01_data_loading_and_overview.ipynb

    • Data loading and understanding
    • Review of data dictionary and business context
  • 02_data_cleaning_eda.ipynb

    • Data cleaning
    • Exploratory Data Analysis (EDA)
    • Class imbalance analysis
  • 03_feature_engineering.ipynb

    • Creation of behavior-based fraud features
    • Encoding categorical variables
    • Reducing multicollinearity
  • 04_data_splitting_and_imbalance_handling.ipynb

    • Feature–target separation
    • Stratified train–test split
    • Handling class imbalance using SMOTE
  • 05_model_training.ipynb

    • Model training using XGBoost
    • Model evaluation using recall, F1-score, and ROC-AUC

3️⃣ models/

  • Stores trained machine learning models
  • Enables reuse or deployment without retraining

4️⃣ reports/

  • Stores outputs generated from the analysis

Subfolders:

  • figures/
    • Performance visualizations (ROC curve, evaluation plots)

5️⃣ README.md

  • Provides complete project documentation
  • Explains the problem, approach, results, and insights

6️⃣ requirements.txt

  • Lists all Python dependencies required to run the project
  • Ensures reproducibility of the environment

Conclusion

This project demonstrates a real-world, interview-ready approach to fraud detection by combining:

  • Strong data understanding
  • Behavior-driven feature engineering
  • Proper handling of class imbalance
  • Industry-standard machine learning models