Feb to May 2025 R

Credit Card Fraud
Detection & Behavioral Analysis

Analyzing 1.8 million credit card transactions to detect fraud through unsupervised behavioral clustering and supervised classification, combining K-means and model-based segmentation with Random Forest, Neural Networks, and Deep Learning.

Problem Context & Objective

The rapid digitization of financial services has created unprecedented convenience — and unprecedented fraud exposure. As customers rely on online and mobile payment systems, fraud tactics evolve rapidly, affecting millions of people and billions of dollars globally. Merchants worldwide now face a wider variety of fraud attacks than ever before, with first-party misuse, account takeovers, and triangulation schemes all on the rise.

This project investigates a central question: how do transactional and behavioral factors affect the likelihood of a person being a fraud victim? Rather than treating fraud detection as a black-box classification task, the goal was to understand why certain transactions are fraudulent — and whether unsupervised behavioral segmentation could improve detection beyond traditional feature-based models.

Three research questions guided the analysis: whether merchant type, city population size, and transaction hour each independently predict fraud likelihood — and how combining all three performs against using them in isolation.

Research Questions

RQQuestionPredictor (X)Outcome (Y)
RQ1 Do certain merchant types have higher fraud incidence? category is_fraud
RQ2 Are merchants in more populated cities more fraud-prone? city_pop is_fraud
RQ3 Does transaction hour affect fraud probability? transaction_hour is_fraud

Data Overview

The dataset contains simulated credit card transactions spanning 2019–2020, drawn from a publicly available Kaggle source. It covers 693 distinct merchants and 999 unique cardholders across the United States, with over 1.8 million transaction records.

Feature GroupExample VariablesType
Transaction Detailsamt, trans_date_trans_time, trans_numNumeric / DateTime
Merchant Infomerchant, category, merch_lat, merch_longCategorical / Numeric
Cardholder Demographicsfirst, last, gender, dob, jobCategorical
Locationcity, state, zip, lat, long, city_popMixed
Engineered Featurestransaction_hour, customer_ageNumeric
Targetis_fraudBinary (0/1)

The dataset is highly imbalanced — fraudulent transactions comprise approximately 0.5% of all records. This imbalance was a central consideration throughout model design and evaluation, with sensitivity (true positive rate) prioritized over overall accuracy.

Data Preparation & Feature Engineering

Two raw transaction datasets were merged and cleaned in R before any analysis. The preparation pipeline ensured consistency, reproducibility, and suitability for both exploratory analysis and modeling.

01
Merge & Deduplicate
Two raw CSV files were combined using bind_rows(). Unnecessary identifier columns were removed, and duplicate records were verified and dropped to retain only unique transactions.
02
Missing Value Imputation
Missing values — scattered across features like age and transaction_day — were imputed using the mice package with predictive mean matching to preserve distributional shape.
03
Type Standardization
Dates were converted from character to date format. Categorical variables were cast as factors. The "fraud_" prefix was removed from merchant names for clean labeling.
04
Feature Engineering
Transaction timestamps were parsed into granular temporal features: transaction_hour, day, month, and year. A customer_age field was derived from cardholder date of birth relative to each transaction date. ZIP codes were standardized with leading zeros.

Exploratory Data Analysis

Before modeling, exploratory analysis revealed consistent behavioral and structural patterns that distinguish fraudulent from legitimate transactions. These patterns directly informed feature selection and the design of clustering scenarios.

Key EDA Findings

DimensionFindingImplication
Class Balance ~0.5% of transactions are fraudulent Accuracy is misleading; sensitivity is the right metric
Merchant Category grocery_pos, shopping_net, misc_net have highest fraud counts Category is a meaningful discriminating feature
Transaction Amount Fraud clusters around mid-range amounts, not extremes Simple amount thresholds would miss most fraud
Time of Day Fraud peaks sharply at hour 22–23; secondary spike at hours 0–2 Late-night transactions are disproportionately risky
Geography Some states and lower-population cities show elevated fraud rates Location provides signal but is not sufficient alone

The concentration of fraud at late-night hours (22–23) and in specific merchant categories like online shopping suggests that fraudsters exploit timing and digital channel vulnerability — patterns that are learnable by behavioral models.

Clustering Approach

Before supervised modeling, unsupervised clustering was applied to identify latent behavioral structures in the transaction data — without using the is_fraud label. The goal was to assess whether behavioral segmentation could separate high-risk transaction groups from low-risk ones.

Nine scenarios were designed across two clustering methods and three feature subsets, plus a no-clustering baseline. Each scenario tested how different variable combinations affect the ability to isolate fraud-prone clusters.

ScenarioFeatures UsedClusteringKey Observation
1All except city_pop, transaction_hourK-meansFraud spread evenly across clusters (<1.3%)
3All except category, transaction_hourK-meansCluster 3: 35.49% fraud — strong isolation
5All except category, city_popK-meansCluster 3: 35.79% fraud — hour drives separation
7All featuresK-meansStrong overall performance, high precision
8All featuresModel-basedBest: highest sensitivity + lowest loss
9All featuresNoneStrong baseline — clustering adds incremental value

Scenarios 3 and 5 demonstrated that transaction hour and city population alone — even without merchant category — can isolate a cluster with over 35% fraud rate while keeping all other clusters below 0.5%. This is a striking signal of temporal behavioral structure in fraud patterns.

Predictive Modeling

Cluster assignments were used as engineered features in four supervised classification models. Each model was evaluated on accuracy, sensitivity, precision, AUC, and total financial loss — with sensitivity prioritized to reflect the real-world cost of missed fraud.

01
Baseline
Logistic Regression
Linear baseline providing coefficients and class probabilities. Useful for understanding linear relationships but insufficient for the nonlinear complexity of fraud patterns. Consistently underperformed on sensitivity.
02
Best
Random Forest
Ensemble of decision trees using randomForest. Captured nonlinear feature interactions, handled class imbalance effectively, and produced the best balance of high sensitivity and low total financial loss across all scenarios.
03
Neural Network (1 hidden layer)
Single hidden layer with 32 neurons using nnet. Captured nonlinear relationships with moderate performance. Sensitivity competitive with Random Forest in several scenarios, but less consistent.
04
Deep Learning (2 hidden layers)
Two-layer architecture: 32 neurons (layer 1), 16 neurons (layer 2). Delivered high accuracy and strong classification power, particularly in scenarios with full feature sets. Highest accuracy model but slightly lower precision than Random Forest.

Results & Best Model

Sensitivity (true positive rate) across all nine scenarios, showing the best-performing model per scenario:

Scenario 1
0.9753
Scenario 2
0.9584
Scenario 3
0.9702
Scenario 4
0.9692
Scenario 5
0.9209
Scenario 6
0.9820
Scenario 7
0.9805
Scenario 8 ★
0.9830
Scenario 9
0.9820
8
/ 9
Best scenario
Model-Based Clustering + Random Forest
ScenarioClusteringModelSensitivityPrecisionLoss ($)
1K-meansDeep Learning 0.97530.0737$4,673
2Model-basedRandom Forest 0.95840.1295$3,588
5K-meansRandom Forest 0.92090.0927$14,905
7K-meansRandom Forest 0.98050.1837$2,933
8 ★Model-basedRandom Forest 0.98300.1734$1,638
9NoneDeep Learning 0.98200.1259$808

Scenario 8 achieves the best balance across all business-relevant metrics: highest sensitivity (0.9830), strong precision (0.1734), and the second-lowest financial loss ($1,638). While Scenario 9 technically shows a lower raw dollar loss, it does so with substantially lower precision — generating more false alarms that burden fraud investigation teams.

Key Learnings & Reflection

  • 01 Fraud exhibits clear behavioral and temporal structure — it is not random. Patterns in merchant type, transaction hour, and city population are learnable signals, not noise. This makes domain-informed feature engineering essential.
  • 02 Unsupervised clustering adds meaningful value when features with natural behavioral boundaries are included. Scenarios 3 and 5 showed that transaction hour and city population alone can isolate clusters with over 35% fraud concentration.
  • 03 Random Forest consistently outperformed simpler and more complex alternatives on the metrics that matter most for business use. Its ensemble structure handles class imbalance, feature interactions, and noisy data well — without requiring hyperparameter tuning to perform competitively.
  • 04 Model evaluation must be framed around business cost, not just accuracy. High sensitivity with moderate precision is the right trade-off here: missing fraud is far more costly than generating a false alert that gets reviewed and cleared.
  • 05 No hyperparameter tuning was performed due to computational constraints. This means the reported results are conservative — all models ran on default settings, and further optimization would likely improve precision without sacrificing sensitivity.

Recommendations & Next Steps

Production Deployment
Combine the Scenario 8 Random Forest model with 2-Factor Authentication (2FA) for flagged transactions — particularly high-risk clusters — to reduce unauthorized access as a real-time safeguard.
Model Improvement
Perform hyperparameter tuning on Random Forest (mtry, ntree) and explore gradient-boosted alternatives like XGBoost with SMOTE oversampling to improve precision without sacrificing sensitivity.
Feature Engineering
Investigate interaction terms — e.g. category × transaction_hour — and velocity features like transaction frequency per cardholder in rolling time windows.
Model Maintenance
Fraud patterns shift over time. Build a continuous retraining pipeline with drift detection so the model adapts to evolving fraud tactics without full manual retraining.