The Most Inspiring Heroes in Film and Television

There are many inspiring heroes in film and television who have touched the hearts of audiences around the world. Some examples include: Katniss Everdeen from “The Hunger Games” series — Katniss is a…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Credit Card Fraud Detection

The dataset comprises of transactions made by European credit card holders in September 2013 in a span of 2 days. Due to confidentiality issues, the original features and more background information about the data is not provided by kaggle. It contains only numerical input variables which are anonymized by applying PCA transformation.

Since V1 to V28 have already gone through PCA we also scaled the Time and Amount features which will be used during training. There are no missing values in the dataset, so we do not need to worry about missing value imputation.

The given dataset is highly imbalanced. Out of the 2,85,000 records only 492 are instances fraud transactions. Training on such a highly imbalanced dataset is not advised as ML models when trained on a highly imbalanced dataset have a natural tendency to pick up the patterns in the majority classes and ignore the minority ones.

Since the majority class represents 99.87% data we cannot rely on Accuracy for performance metrics. Even if our model gives us accuracy of 99.9 % there is a chance that the model just classified all the transactions as non-fraud.

So we focus on Confusion Matrix for evaluation, which in turn gives us Precision, Recall and F1 Score.

Confusion Matrix

True Positive (TP): Model correctly predicts the positive class.

True Negative (TN): Model correctly predicts the negative class.

False Positive (FP): Model incorrectly predicts the positive class.

False Negative (FN): Model incorrectly predicts the negative class.

Although it is not advisable we tried to make a model on the unbalanced data just to see how it turns out. We have used Decision Tree Classifier and Naïve Bayes.

The results we got sound pretty good if we are evaluating the model based on metrics alone. The reason why we got such results is because our training features were able to distinctly classify the records in most cases.

Even though the results are good this model is not reliable as when it comes across a record which is not clearly distinct, our model will prefer the majority class in most of the cases.

In order to deal with data imbalance we tried two approaches:

1. Random Undersampling

2. SMOTE

We concatenate fraud data with randomly undersampled non-fraud data such that the ratio of Fraud and Non Fraud class in training set is equal to 1 and split it for training and testing.

After 70:30 train:test split and 1:1 undersampling, our training set has 688 records.

The algorithms we tried for classification:

1. Logistic Regression

2. Random Forest Classifier

3. SVM

4. KNN

Random Undersampling 1:1 ratio

The Random Forest Classifier gives us a very high recall of 0.96 but the precision is just 0.05 hence F1 score takes the hit.

The precision is so low because we have used just 344 non-fraud records for training, so we can try increasing the no. of non-fraud records for training. We also don’t want to highly imbalance the dataset so we can try 1:2 ratio for the classes.

The precision improves but it’s still very low. Also, the data is not fully balanced.

SMOTE stands for Synthetic Minority Oversampling Technique. It is a statistical technique for increasing the number of minority records in the dataset. SMOTE generates new instances of minority class from the existing ones. It does not change the number of majority cases.

The new instances are not just copies of existing minority cases. The algorithm takes samples of the features of minority class and combines nearest neighbors of features with features from sample cases to generate new examples.

SMOTE takes the entire training dataset as an input, but it increases the percentage of only the minority (fraud) cases. We have set the sampling strategy parameter to 1. It indicates that the ratio of fraud to non-fraud data is 1 i.e. our training set has 50% fraud data and 50% data and hence fully balanced. After SMOTE our training set now comprises of 398K records.

NOTE: We have applied SMOTE to the training set only.

Once again we used the same algorithms:

1. Logistic Regression

2. Random Forest Classifier

3. SVM

4. KNN

Confusion Matrix for SMOTE algorithm

With the F1 Score of 0.84 it’s clear that Random Forest Classifier outperforms other algorithms by a large factor.

Since our training dataset has enough data from both the classes, its Precision increases but the Recall is lower than the Undersampling model. The Undersampling model has such a high recall because it ends up misclassifying a lot of records as Fraud and thereby increasing the recall. So even though the recall is lower it’s much more reliable due to high precision.

To make sure that the data is not overfitting we have performed K-fold cross validation. The F1 Score during the k-fold cross validation is consistently above 80 with the average being 0.82. The consistently indicates that the data is not overfitted to the training data.

With the increase in cyber crimes day by day it Is important for banks to quickly identify fraud and take the necessary steps to minimize the financial damages as soon as possible. Credit card fraud detection system using Undersampling and SMOTE (Synthetic minority optimization technique) aims in identifying the fraud transactions occurring during the transactions made by the card holder.

Add a comment

Related posts:

What is Chat GPT?

Chat GPT is a powerful AI tool that can be used for marketing and communication. It is based on the GPT-3 language model (Generative Pertained Transformer) created by Open AI that has been refined…

Indesign Tips for Corporate Interior Designers in Hyderabad

If you are a corporate interior designer in Hyderabad, you may be looking for some tips to help you improve your work. In this blog post, we will discuss some of the best tips for using Indesign to…

5 dicas para alavancar a venda da sua loja de roupas

As vendas de Natal e Ano Novo sempre trazem a maior quantidade de vendas durante todo o ano. Afinal, o Natal é a maior data do varejo — e quem não quer aproveitar ao época que marca a ocasião…