-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFraud_Classifier.Rmd
652 lines (594 loc) · 29.6 KB
/
Fraud_Classifier.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
---
title: "Credit Card Transaction Fraud Classifier"
author: "Mu Niu"
date: "2023-12-10"
output: pdf_document
---
```{r, echo = F, message = F, warning = F}
# load the packages
library(tidymodels)
library(xgboost)
library(tidyverse)
library(pROC)
library(ggplot2)
library(gridExtra)
library(DMwR)
library(caret)
library(e1071)
```
## Abstract
In this project, I utilized a credit card transaction dataset from
Kaggle to develop a fraud detection classifier. The Exploratory Data
Analysis revealed an imbalanced distribution in the fraud response
variable, with no missing values. To address this imbalance, I
implemented SMOTE, enhancing the model’s ability to classify the
minority class effectively. Our primary model was XGBoost, complemented
by an SVM trained on normalized data for comparative analysis. The
XGBoost model outperformed the SVM in AUC, and also offered advantages
in interpretability, computational efficiency, and robustness. The
conclusion favored XGBoost as the superior model for this scenario.
Further evaluation using the confusion matrix highlighted that adjusting
the XGBoost’s probability-to-class label threshold could further refine
the model, particularly by reducing the false positive rate.
## Introduction
**(1)Problem**
- Problem Formulation:
In the rapidly evolving landscape of digital payments, the challenge of
detecting fraud is escalating, particularly with trillions of card
transactions processed daily. This project aims to tackle this challenge
by focusing on fraud classification using a dataset with 7 predictor
variables. I will primarily utilize XGBoost for this task. Additionally,
to assess and validate our approach, I plan to train a Support Vector
Machine as a secondary model and compare the performance of both models
using the Area Under the Curve metric.
- Statistical Learning Algorithms Discussion:
The Main Machine Learning Algorithm we are going to apply is XGBoost,
which is a sophisticated ensemble machine learning technique based on
decision trees. Operating within a gradient boosting framework, XGBoost
constructs a series of weak learners sequentially. Each model in the
series aims to rectify the errors of its predecessors, leading to a
robust and accurate composite model.
On the other hand, we will train a Support Vector Machine for
comparative analysis. The SVM is a powerful supervised learning
algorithm designed to find an optimal hyperplane in an N-dimensional
space(where N represents the number of features). This hyperplane
effectively categorizes data points into distinct classes, with a focus
on maximizing the margin between the data points and the hyperplane,
thereby enhancing classification accuracy. By comparing the performance
of XGBoost and SVM, particularly through the lens of AUC, I aim to find
the most effective model for fraud detection in our dataset.
**(2)Data Set**
- Source(link): [Credit Card
Fraud](https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud/data)
- Feature Explanation:
| Variable Name | Feature Explanation | Type |
|:------------------------:|:--------------------------:|:-----------------:|
| distance_from_home | The distance from home where the transaction happened | Continuous |
| distance_from_last_transaction | The distance from last transaction happened | Continuous |
| ratio_to_median_purchase_price | Ratio of purchased price transaction to median purchase price | Continuous |
| repeat_retailer | Is the transaction happened from same retailer | Discrete |
| used_chip | Is the transaction through credit card chip | Discrete |
| used_pin_number | Is the transaction happened by using PIN number | Discrete |
| online_order | Is the transaction an online order | Discrete |
| fraud | Is the transaction fraudulent | Discrete |
- Exploratory Data Analysis and Visualizations
```{r echo=F, message=FALSE, warning = F}
# load the data
data <- read_csv("card_transdata.csv")
```
```{r, cache = T, echo = F}
# plot distribution of the response
plot1 <- ggplot(data, aes(x = ifelse(data$fraud == 0, 'Safe', 'Fraud'))) +
geom_bar(fill = c('cyan','cornflowerblue')) +
labs(title="Distribution of Fraud in the Dataset", x="Fraud Class", y="Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.5, colour = "white")
table1 <- table(ifelse(data$fraud == 0, 'Safe', 'Fraud')) %>% rbind(c('8.74%','91.26%')) %>% as.data.frame() %>% tableGrob()
grid.arrange(arrangeGrob(plot1, table1, ncol=2))
```
```{r, echo = F, cache = T}
# plot distribution of the predictor variables
vars <- data %>% select(-fraud)
par(mfrow = c(3, 3))
for (col in colnames(vars)) {
hist(vars[[col]], main = col, xlab = '', col = "skyblue", border = "black", breaks = 10)}
mtext('Distribution of the Predictor Variables', side = 3, line = -1.2, outer = TRUE)
```
```{r, echo = F}
# dimension of the data set
c('Observation','Variable') %>% rbind(dim(data)) %>% knitr::kable()
```
```{r}
# Check missing values
data %>% is.na() %>% sum()
```
- Data Description
The "Credit Card Fraud" dataset, sourced from Kaggle and published by
Dhanush Narayanan R, comprises 1,000,000 observations and 8 variables.
It includes continuous variables like 'distance_from_home' and
'distance_from_last_transaction', as well as categorical variables such
as 'repeat_retailer', 'used_chip', and 'fraud', which are represented
numerically(1 means Yes/0 means No). Our Exploratory Data Analysis(EDA)
revealed no missing values, but highlighted an imbalanced response
variable. To address this and the varying scales of continuous
variables, we will apply over-sampling techniques like SMOTE and
normalization before training our models, ensuring robustness,
especially for scale-sensitive algorithms like SVM.
## Data Preparation
The dataset was divided into training(60%), validation(20%), and
testing(20%) subsets, with stratification on the 'fraud' variable to
maintain representation of the original dataset across all subsets. This
validation set is particularly crucial for our analysis using the
**xgb.train()** function in R, which has a **watchlist** parameter
containing both training and validation set. This parameter allows us to
monitor log-loss on both training and validation sets during training,
aiding in the prevention of overfitting. In the data description
section, we noted the varying scales of continuous variables. Given our
intention to use the SVM algorithm, which is sensitive to scale
differences, we decided to normalize each dataset separately to enhance
model performance.
However, our Exploratory Data Analysis revealed an imbalance in the
'fraud' variable, potentially affecting model performance on the
minority class. To address this, we will apply SMOTE(Synthetic Minority
Over-sampling Technique) to the training set. SMOTE will generate
additional observations for the minority class, alleviating the
imbalance issue. The downside of the SMOTE is that it may create
instances in noisy areas, which can lead to the creation of observations
that are not representative of the minority class and may negatively
impact classification performance. Hence, we aim for a ratio of
approximately 2:1 between the majority and minority classes, avoiding a
perfect 1:1 ratio that might introduce too much synthetic data and
potentially harm the classification performance. This approach ensures a
more balanced dataset after performing SMOTE, which could improve model
performance on the minority class.
```{r, cache = T, echo = F}
# set seed for reproducibility
set.seed(131)
# split the data
data.split <- data %>%
initial_split(prop = 0.6, strata = fraud)
test.split <- initial_split(testing(data.split), prop = 0.5, strata = fraud)
data.train <- training(data.split)
data.val <- training(test.split)
data.test <- testing(test.split)
# convert discrete variables to categorical variable(required by SMOTE function)
smote.data <- data.train %>% select(1:3) %>%
cbind('repeat_retailer' = as.factor(data.train$repeat_retailer)) %>%
cbind('used_chip' = as.factor(data.train$used_chip)) %>%
cbind('used_pin_number' = as.factor(data.train$used_pin_number)) %>%
cbind('online_order' = as.factor(data.train$online_order)) %>%
cbind('fraud' = as.factor(data.train$fraud))
# compute parameter for a ratio of 1.6:1 between majority class and minority class
perc.over <- round(((0.5 * 547609 - 52391) / 52391) * 100, 0)
# apply SMOTE for imbalanced data
smote <- SMOTE(fraud ~., smote.data, perc.over = perc.over, k =5)
# normalize each data set for FNN
norm.train <- as.data.frame(lapply(smote, function(x) {
if (is.numeric(x)) (x - min(x)) / (max(x) - min(x))
else x
}))
norm.val <- as.data.frame(lapply(data.val, function(x) {
if (is.numeric(x)) (x - min(x)) / (max(x) - min(x))
else x
}))
norm.test <- as.data.frame(lapply(data.test, function(x) {
if (is.numeric(x)) (x - min(x)) / (max(x) - min(x))
else x
}))
```
```{r, echo = F}
# plot distribution of the smote response
plot2 <- ggplot(smote, aes(x = ifelse(smote$fraud == 0, 'Safe', 'Fraud'))) +
geom_bar(fill = c('cyan','cornflowerblue')) +
labs(title="Distribution of Fraud in SMOTE Training Dataset", x="Fraud Class", y="Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.5, colour = "white")
table2 <- table(ifelse(smote$fraud == 0, 'Safe', 'Fraud')) %>% rbind(c('38.46%','61.54%')) %>% as.data.frame() %>% tableGrob()
grid.arrange(arrangeGrob(plot2, table2, ncol=2))
```
## Models Training and Outcomes
**Main Model: XGBoost**
- XGBoost Model Training and Parameter Tunning
In our model training process, we utilized the watchlist feature,
incorporating both training and validation datasets. This approach
enables continuous monitoring of the model’s performance on both sets,
serving as a safeguard to prevent overfitting. Additionally, we
implemented the early_stopping_rounds parameter, set to halt training if
there’s no improvement in the model's validation performance for 50
consecutive rounds. This strategy is effective in determining the
optimal number of boosting rounds to avoid overfitting while maximizing
performance. In our case, the model identified 353 as the best number of
boosting rounds, indicating that further training beyond this point led
to overfitting. This early stopping mechanism is a critical component in
fine-tuning our model, ensuring that it achieves robust performance
without overfitting to the training data.
```{r, echo = F, message = F}
# convert categorical variable to numeirc(required by XGBoost function)
xgb.train.data <- smote %>% select(1:3) %>%
cbind('repeat_retailer' = as.numeric(smote$repeat_retailer) - 1) %>%
cbind('used_chip' = as.numeric(smote$used_chip) - 1) %>%
cbind('used_pin_number' = as.numeric(smote$used_pin_number) - 1) %>%
cbind('online_order' = as.numeric(smote$online_order) - 1) %>%
cbind('fraud' = as.numeric(smote$fraud) - 1)
# define predictor and response variables in training set
# NOTE: XGBoost only use matrix data
xgb.train.x <- data.matrix(xgb.train.data %>% select(-fraud))
xgb.train.y <- xgb.train.data %>% pull(fraud)
# define predictor and response variables in validation set
xgb.val.x <- data.matrix(data.val %>% select(-fraud))
xgb.val.y <- data.val %>% pull(fraud)
# define predictor and response variables in testing set
xgb.test.x <- data.matrix(data.test %>% select(-fraud))
xgb.test.y <- data.test %>% pull(fraud)
# define xgb.DMatirx: a specialized data structure xgboost uses for efficiency
xgb.train <- xgb.DMatrix(data = xgb.train.x, label = xgb.train.y)
xgb.val <- xgb.DMatrix(data = xgb.val.x, label = xgb.val.y)
xgb.test <- xgb.DMatrix(data = xgb.test.x, label = xgb.test.y)
# define watchlist to monitor training process & prevent overfitting
watchlist = list(train = xgb.train, validation = xgb.val)
# define params
params <- list(
objective = "binary:logistic", # set goal to predict probability of fraud
eta = 0.3 # learning rate: default 0.3 can prevent overfitting
)
#fit XGBoost model and display training and testing data at each round
set.seed(131)
model <- xgb.train(params = params,
data = xgb.train, # Training data
max.depth = 3, # Size of each individual tree: rule of thumb is 3 to prevent overfitting
watchlist = watchlist, # Track model performance on train/validation
nrounds = 500, # Number of boosting iterations: more observation, more rounds
early_stopping_rounds = 50, # threshold to stop training: usually 10% of nrounds
verbose = 0)
```
- Final XGBoost Model Result and Presentation
Our final XGBoost model, fine-tuned with the optimal number of boosting
rounds, demonstrated feature importance insights. As a tree-based
method, it highlighted the top three influential features in fraud
detection. The most critical among these is the ratio of the transaction
purchase price to the median purchase price. This finding suggests that
transactions with amounts significantly deviating from the median are
more likely to be fraudulent. Additionally, the distances from home and
from the last transaction location to the current transaction's location
are 2 nearly equally important features. Transactions occurring at
locations far from the usual transaction points(home or previous
transaction location) are also more likely to be identified as
fraudulent.
Evaluating the model's performance on the testing dataset using the
AUC-ROC plot and the confusion matrix, we observed an impressive AUC of
0.9891. This high score indicates the model's excellent capability in
classifying the testing data and its strong generalization to new,
unseen data. Building on these insights, our next step involves training
a Support Vector Machine(SVM) and comparing its performance to that of
the XGBoost model, to further validate the effectiveness of our chosen
approach in fraud detection.
```{r, echo = F, message = F, warning = F}
set.seed(131)
# Define final model
# The argument verbose = 0 tells R not to display the training and testing error for each round.
final <- xgboost(params = params, data = xgb.train, max.depth = 3, nrounds = 353, verbose = 0)
### Feature Importance
importance <- xgb.importance(feature_names = colnames(xgb.train.x), model = final)
# plot 3 most important features
importance_df <- as.data.frame(importance)[1:3,]
ggplot(importance_df, aes(x = reorder(Feature, Gain), y = Gain)) +
geom_col() +
coord_flip() +
theme_minimal() +
theme(
# axis.text.y = element_text(size = 8, angle = 45, hjust = 1),
axis.title.y = element_blank(),
axis.title.x = element_text(size = 10)) +
labs(title = "Feature Importance", x = "Gain")
```
```{r, echo = F, message = F, warning = F}
# Use model to make predictions on test data
pred.y <- predict(final, xgb.test.x)
# Label test data according to the predicted probability
pred.label <- ifelse(pred.y > 0.5, 1, 0)
# Confusion Matrix
confusion.matrix <- table(Predicted = pred.label, Actual = xgb.test.y)
# AUC-ROC
roc <- roc(xgb.test.y, pred.label)
auc <- auc(roc)
# Visualization
ggroc(roc) +
labs(title = "ROC Curve: XGBoost", x = "False Positive Rate", y = "True Positive Rate") +
annotate("text", x = 0.2, y = 0.8, label = paste("AUC =", round(auc, 4)))
```
**Confusion Matrix**
| | Actual | |
|:---------:|:-------------------------:|:-------------------------:|
| Predicted | 0 | 1 |
| 0 | `r confusion.matrix[1,1]` | `r confusion.matrix[1,2]` |
| 1 | `r confusion.matrix[2,1]` | `r confusion.matrix[2,2]` |
**Comparison Model: Support Vector Machine**
- SVM Training and Parameter Tunning
In this section, I am focusing on training a Support Vector Machine(SVM)
to classify fraud and measuring its performance through AUC. Given SVM's
sensitivity to data scaling, I am using normalized data to ensure
uniform contribution of each feature. I've chosen a radial kernel(RBF)
for the SVM since it’s a good fit for our dataset's non-linear
relationship between predictors and response variable. In training our
SVM, I utilized a randomly sampled portion of the normalized training
data due to the extensive size of the dataset with 681,083 observations.
This strategy is efficient as SVMs primarily rely on support vectors,
which are a subset of data, for determining the decision boundary. Such
sampling not only captures essential model-building information
effectively but also significantly lowers the demands for computational
power, which allows me to train the model on my laptop. Moreover, it
accelerates hyperparameter tuning and model validation, allowing more
efficient experimentations.
To optimize SVM performance, I am utilizing the tune() function for
10-fold cross-validation, aimed at finding the best cost and gamma
values. To be more specific on these parameters, the cost is a
regularization parameter in SVM. It controls the trade-off between
achieving a low training error and a low testing error, so a small value
of cost could potentially underfitting the training data, while a large
value of cost might cause overfitting. The gamma parameter defines how
far the influence of a single training example reaches. That is, the
decision boundary will be influenced more by the support vectors with a
large gamma value, leading to a more complex, decision boundary, which
might be prone to overfitting. However, with a small gamma value, the
decision boundary will be smoother. This can be good for generalization
but might also lead to underfitting. The cross-validation process not
only identifies these optimal parameters but also yields a model
fine-tuned with them, as shown in the table below.
```{r echo=F, message=FALSE, cache = FALSE}
# Train SVM
set.seed(131)
svm.train <- norm.train %>% select(1:3) %>%
cbind('repeat_retailer' = as.numeric(smote$repeat_retailer) - 1) %>%
cbind('used_chip' = as.numeric(smote$used_chip) - 1) %>%
cbind('used_pin_number' = as.numeric(smote$used_pin_number) - 1) %>%
cbind('online_order' = as.numeric(smote$online_order) - 1) %>%
cbind('fraud' = smote$fraud)
# sample training data due to limitied computation power
sample_indices <- sample(1:nrow(svm.train),
size = 0.01 * nrow(svm.train))
svm.sample <- svm.train[sample_indices, ]
# find optimal parameters
tune.out <- tune(svm, fraud ~., data = svm.sample,
kernel = 'radial', scale = FALSE,
ranges = list(cost=c(1,5,100,1000), gamma=c(0.1,1,5,10)))
# show best parameters
summary(tune.out)$"best.parameters" %>% knitr::kable()
# test data set for SVM
svm.test = data.frame(norm.test %>% select(-fraud),
fraud = norm.test %>% pull(fraud) %>% as.factor())
# predict base on test x
svm.pred = predict(tune.out$best.model, svm.test)
# confusion matrix
# table(Predicted = svm.pred, Actual = svm.test$fraud)
# ROC-AUC
roc2 <- roc(svm.test$fraud, svm.pred %>% as.numeric())
auc2 <- auc(roc2)
# Visualization
ggroc(roc2) +
labs(title = "ROC Curve: SVM", x = "False Positive Rate",
y = "True Positive Rate") +
annotate("text", x = 0.2, y = 0.8, label = paste("AUC =", round(auc2, 4)))
```
## Conclusions and Future Directions for Improvement
Reviewing the ROC-AUC plots for both the XGBoost and SVM models, it’s
evident that XGBoost out-performs SVM in this dataset. XGBoost’s
superiority extends beyond the accuracy. As a tree-based method, it
demonstrates robustness, showing less sensitivity to data scaling
compared to SVM. Furthermore, XGBoost excels in interpretability,
particularly in identifying key features influencing the model.
Additionally, XGBoost is computationally more efficient, operating
faster than SVM, which is a substantial advantage in practical
applications. These factors collectively affirm that XGBoost has better
performance over SVM for this specific dataset.
However, there’s room for improvement in the XGBoost model, especially
regarding its false positive rate. The confusion matrix reveals that the
model occasionally misclassifies safe transactions as fraudulent. This
over-caution can disrupt customer experience in real-world applications,
such as shopping, which potentially harm customer trust. A promising
area for improvement is to adjust the threshold of mapping predicted
probability to the class label. The currently threshold is set at 50%,
where transactions with a fraud probability over this are marked as
fraud, and a lower threshold could reduce false positives. Gaining more
domain knowledge through research in related fields is a valuable
approach for further improvement. With this enhanced understanding, we
can more effectively adjust the threshold used to classify transactions
as fraudulent. Additionally, we can empirically test different
thresholds on the current model, comparing performances to identify the
most effective setting. This practical approach will help us minimize
the false positive rate and thereby improve the classification accuracy.
| | Actual | |
|:---------:|:-------------------------:|:-------------------------:|
| Predicted | 0 | 1 |
| 0 | `r confusion.matrix[1,1]` | `r confusion.matrix[1,2]` |
| 1 | `r confusion.matrix[2,1]` | `r confusion.matrix[2,2]` |
## Appendix
```{r, eval = F}
# load the packages
library(tidymodels)
library(xgboost)
library(tidyverse)
library(pROC)
library(ggplot2)
library(gridExtra)
library(DMwR)
library(caret)
library(e1071)
# load the data
data <- read_csv("card_transdata.csv")
# plot distribution of the response
plot1 <- ggplot(data, aes(x = ifelse(data$fraud == 0, 'Safe', 'Fraud'))) +
geom_bar(fill = c('cyan','cornflowerblue')) +
labs(title="Distribution of Fraud in the Dataset",
x="Fraud Class", y="Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = after_stat(count)), stat = "count",
vjust = 1.5, colour = "white")
table1 <- table(ifelse(data$fraud == 0, 'Safe', 'Fraud')) %>%
rbind(c('8.74%','91.26%')) %>% as.data.frame() %>% tableGrob()
grid.arrange(arrangeGrob(plot1, table1, ncol=2))
# plot distribution of the predictor variables
vars <- data %>% select(-fraud)
par(mfrow = c(3, 3))
for (col in colnames(vars)) {
hist(vars[[col]], main = col, xlab = '', col = "skyblue",
border = "black", breaks = 10)}
mtext('Distribution of the Predictor Variables',
side = 3, line = -1.2, outer = TRUE)
# dimension of the data set
c('Observation','Variable') %>% rbind(dim(data)) %>% knitr::kable()
# Check missing values
data %>% is.na() %>% sum()
# set seed for reproducibility
set.seed(131)
# split the data
data.split <- data %>%
initial_split(prop = 0.6, strata = fraud)
test.split <- initial_split(testing(data.split), prop = 0.5, strata = fraud)
data.train <- training(data.split)
data.val <- training(test.split)
data.test <- testing(test.split)
# convert discrete variables to categorical variable(required by SMOTE function)
smote.data <- data.train %>% select(1:3) %>%
cbind('repeat_retailer' = as.factor(data.train$repeat_retailer)) %>%
cbind('used_chip' = as.factor(data.train$used_chip)) %>%
cbind('used_pin_number' = as.factor(data.train$used_pin_number)) %>%
cbind('online_order' = as.factor(data.train$online_order)) %>%
cbind('fraud' = as.factor(data.train$fraud))
# compute parameter for a ratio of 1.6:1 between majority class and minority class
perc.over <- round(((0.5 * 547609 - 52391) / 52391) * 100, 0)
# apply SMOTE for imbalanced data
smote <- SMOTE(fraud ~., smote.data, perc.over = perc.over, k =5)
# normalize each data set for SVM
norm.train <- as.data.frame(lapply(smote, function(x) {
if (is.numeric(x)) (x - min(x)) / (max(x) - min(x))
else x}))
norm.val <- as.data.frame(lapply(data.val, function(x) {
if (is.numeric(x)) (x - min(x)) / (max(x) - min(x))
else x}))
norm.test <- as.data.frame(lapply(data.test, function(x) {
if (is.numeric(x)) (x - min(x)) / (max(x) - min(x))
else x}))
# plot distribution of the smote response
plot2 <- ggplot(smote, aes(x = ifelse(smote$fraud == 0, 'Safe', 'Fraud'))) +
geom_bar(fill = c('cyan','cornflowerblue')) +
labs(title="Distribution of Fraud in SMOTE Training Dataset",
x="Fraud Class", y="Count") +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = after_stat(count)), stat = "count", vjust = 1.5,
colour = "white")
table2 <- table(ifelse(smote$fraud == 0, 'Safe', 'Fraud')) %>%
rbind(c('38.46%','61.54%')) %>% as.data.frame() %>% tableGrob()
grid.arrange(arrangeGrob(plot2, table2, ncol=2))
# convert categorical variable to numeirc(required by XGBoost function)
xgb.train.data <- smote %>% select(1:3) %>%
cbind('repeat_retailer' = as.numeric(smote$repeat_retailer) - 1) %>%
cbind('used_chip' = as.numeric(smote$used_chip) - 1) %>%
cbind('used_pin_number' = as.numeric(smote$used_pin_number) - 1) %>%
cbind('online_order' = as.numeric(smote$online_order) - 1) %>%
cbind('fraud' = as.numeric(smote$fraud) - 1)
# define predictor and response variables in training set
# NOTE: XGBoost only use matrix data
xgb.train.x <- data.matrix(xgb.train.data %>% select(-fraud))
xgb.train.y <- xgb.train.data %>% pull(fraud)
# define predictor and response variables in validation set
xgb.val.x <- data.matrix(data.val %>% select(-fraud))
xgb.val.y <- data.val %>% pull(fraud)
# define predictor and response variables in testing set
xgb.test.x <- data.matrix(data.test %>% select(-fraud))
xgb.test.y <- data.test %>% pull(fraud)
# define xgb.DMatirx: a specialized data structure xgboost uses for efficiency
xgb.train <- xgb.DMatrix(data = xgb.train.x, label = xgb.train.y)
xgb.val <- xgb.DMatrix(data = xgb.val.x, label = xgb.val.y)
xgb.test <- xgb.DMatrix(data = xgb.test.x, label = xgb.test.y)
# define watchlist to monitor training process & prevent overfitting
watchlist = list(train = xgb.train, validation = xgb.val)
# define params
params <- list(
objective = "binary:logistic", # set goal to predict probability of fraud
eta = 0.3 # learning rate: default 0.3 can prevent overfitting
)
#fit XGBoost model and display training and testing data at each round
set.seed(131)
model <- xgb.train(params = params,
data = xgb.train, # Training data
# Size of each individual tree: rule of thumb is 3 to prevent overfitting
max.depth = 3,
# Track model performance on train/validation
watchlist = watchlist,
# Number of boosting iterations: more observation, more rounds
nrounds = 500,
# threshold to stop training: usually 10% of nrounds
early_stopping_rounds = 50,
# silent: don't show processing result
verbose = 0)
set.seed(131)
# Define final model
# The verbose = 0 tells R don't show training and testing error for each round.
final <- xgboost(params = params, data = xgb.train,
max.depth = 3, nrounds = 353, verbose = 0)
### Feature Importance
importance <- xgb.importance(feature_names = colnames(xgb.train.x), model = final)
# plot 3 most important features
importance_df <- as.data.frame(importance)[1:3,]
feature.plot <- ggplot(importance_df, aes(x = reorder(Feature, Gain), y = Gain)) +
geom_col() +
coord_flip() +
theme_minimal() +
theme(axis.title.y = element_blank(),
axis.title.x = element_text(size = 10)) +
labs(title = "Feature Importance", x = "Gain")
# Use model to make predictions on test data
pred.y <- predict(final, xgb.test.x)
# Label test data according to the predicted probability
pred.label <- ifelse(pred.y > 0.5, 1, 0)
# Confusion Matrix
confusion.matrix <- table(Predicted = pred.label, Actual = xgb.test.y)
# AUC-ROC
roc <- roc(xgb.test.y, pred.label)
auc <- auc(roc)
# Visualization
ggroc(roc) +
labs(title = "ROC Curve: XGBoost", x = "False Positive Rate",
y = "True Positive Rate") +
annotate("text", x = 0.2, y = 0.8, label = paste("AUC =", round(auc, 4)))
# XGBoost Confusion Matrix
confusion.matrix
# scatter plot to determine if linear relationship exists
par(mfrow = c(3, 3))
for (col in colnames(vars)) {
plot(data$fraud, vars[[col]], main = col, xlab = '', ylab = '',
col = "skyblue", border = "black") }
# Train SVM
set.seed(131)
svm.train <- norm.train %>% select(1:3) %>%
cbind('repeat_retailer' = as.numeric(smote$repeat_retailer) - 1) %>%
cbind('used_chip' = as.numeric(smote$used_chip) - 1) %>%
cbind('used_pin_number' = as.numeric(smote$used_pin_number) - 1) %>%
cbind('online_order' = as.numeric(smote$online_order) - 1) %>%
cbind('fraud' = smote$fraud)
# sample training data due to limited computation power
sample_indices <- sample(1:nrow(svm.train),
size = 0.01 * nrow(svm.train))
svm.sample <- svm.train[sample_indices, ]
# find optimal parameters
tune.out <- tune(svm, fraud ~., data = svm.sample,
kernel = 'radial', scale = FALSE,
ranges = list(cost=c(1,5,100,1000),gamma=c(0.1,1,5,10)))
# show best parameters
summary(tune.out)$"best.parameters" %>% knitr::kable()
# test data set for SVM
svm.test = data.frame(norm.test %>% select(-fraud),
fraud = norm.test %>% pull(fraud) %>% as.factor())
# predict base on test x
svm.pred = predict(tune.out$best.model, svm.test)
# confusion matrix
table(Predicted = svm.pred, Actual = svm.test$fraud)
# ROC-AUC
roc2 <- roc(svm.test$fraud, svm.pred %>% as.numeric())
auc2 <- auc(roc2)
# Visualization
ggroc(roc2) +
labs(title = "ROC Curve: SVM", x = "False Positive Rate",
y = "True Positive Rate") +
annotate("text", x = 0.2, y = 0.8, label = paste("AUC =", round(auc2, 4)))
```