Credit Card Fraud Detection

This repository contains a Python script that implements various machine learning models to detect credit card fraud based on a dataset of anonymized credit card transactions.

The dataset used in this project can be found on Kaggle: Credit Card Fraud Detection Dataset

Overview

The script follows these steps:

Import libraries
Load the dataset
Preprocess the data (handle missing values, duplicates, and outliers)
Standardize numerical features
Split the data into training and testing sets
Handle imbalanced data using SMOTE
Train multiple models (Logistic Regression, Random Forest, KNN, XGBoost, and LightGBM)
Evaluate model performance using various metrics
Perform hyperparameter tuning (optional)

Based on the evaluation metrics, the Random Forest model performs the best among the selected models.

Model Performance

The following table shows the performance of the models used in this project:

Model	AUPRC	Precision (class 1)	Recall (class 1)	F1-score (class 1)	False Positives	False Negatives
Logistic Regression	0.7322	0.80	0.64	0.71	16	28
Random Forest	0.8348	0.91	0.76	0.83	7	23
KNN	0.5461	0.24	0.83	0.37	209	13
XGBoost	0.7943	0.89	0.68	0.77	9	25
LightGBM	0.8159	0.88	0.73	0.80	10	21

Usage

To run the script, you will need JupyterNotebook and the following libraries installed:

numpy
pandas
scikit-learn
imbalanced-learn
scipy
xgboost
lightgbm

Download the dataset from the source and place it in the same directory as the script. Then, simply run the script using Jupyternotebook.

Conclusion

Based on the evaluation metrics, the Random Forest model performs the best among the selected models for this imbalanced dataset. The reasoning behind this conclusion is as follows:

AUPRC (Area Under the Precision-Recall Curve): Random Forest has the highest AUPRC (0.8348) among all models, indicating better overall performance in distinguishing between the classes when dealing with imbalanced data.
Precision, Recall, and F1-score: Random Forest has the highest precision (0.91) for the positive class (fraud), which means it has the lowest false positive rate among the models. It also has a good recall (0.76), which means it can detect a considerable proportion of the actual fraud cases. The F1-score (0.83) for the positive class in Random Forest is also the highest, indicating a good balance between precision and recall.
Confusion Matrix: The confusion matrix of the Random Forest model shows the smallest number of false positives (7) and a relatively low number of false negatives (23) compared to other models.

Although the accuracy is high for all the models, it is not a reliable metric in this case due to the highly imbalanced nature of the dataset. The other metrics mentioned above provide a better perspective on the model's performance.

Considering all these factors, the Random Forest model seems to be the best performer among the selected models for this imbalanced dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README.md		README.md
credit_fraud_detector.ipynb		credit_fraud_detector.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Credit Card Fraud Detection

Overview

Model Performance

Usage

Conclusion

About

Releases

Packages

Languages

qvunguyen/credit-card-fraud-detection

Folders and files

Latest commit

History

Repository files navigation

Credit Card Fraud Detection

Overview

Model Performance

Usage

Conclusion

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages