Feb to May 2025 R

Credit Card Fraud
Detection & Behavioral Analysis

Analyzing 1.8 million credit card transactions to detect fraud through unsupervised behavioral clustering and supervised classification, combining K-means and model-based segmentation with Random Forest, Neural Networks, and Deep Learning.

Problem Context & Objective

The rapid digitization of financial services has created unprecedented convenience — and unprecedented fraud exposure. As customers rely on online and mobile payment systems, fraud tactics evolve rapidly, affecting millions of people and billions of dollars globally. Merchants worldwide now face a wider variety of fraud attacks than ever before, with first-party misuse, account takeovers, and triangulation schemes all on the rise.

This project investigates a central question: how do transactional and behavioral factors affect the likelihood of a person being a fraud victim? Rather than treating fraud detection as a black-box classification task, the goal was to understand why certain transactions are fraudulent — and whether unsupervised behavioral segmentation could improve detection beyond traditional feature-based models.

Three research questions guided the analysis: whether merchant type, city population size, and transaction hour each independently predict fraud likelihood — and how combining all three performs against using them in isolation.

Full Report

Research Questions

RQ	Question	Predictor (X)	Outcome (Y)
RQ1	Do certain merchant types have higher fraud incidence?	`category`	`is_fraud`
RQ2	Are merchants in more populated cities more fraud-prone?	`city_pop`	`is_fraud`
RQ3	Does transaction hour affect fraud probability?	`transaction_hour`	`is_fraud`

Data Overview

The dataset contains simulated credit card transactions spanning 2019–2020, drawn from a publicly available Kaggle source. It covers 693 distinct merchants and 999 unique cardholders across the United States, with over 1.8 million transaction records.

Feature Group	Example Variables	Type
Transaction Details	`amt`, `trans_date_trans_time`, `trans_num`	Numeric / DateTime
Merchant Info	`merchant`, `category`, `merch_lat`, `merch_long`	Categorical / Numeric
Cardholder Demographics	`first`, `last`, `gender`, `dob`, `job`	Categorical
Location	`city`, `state`, `zip`, `lat`, `long`, `city_pop`	Mixed
Engineered Features	`transaction_hour`, `customer_age`	Numeric
Target	`is_fraud`	Binary (0/1)

The dataset is highly imbalanced — fraudulent transactions comprise approximately 0.5% of all records. This imbalance was a central consideration throughout model design and evaluation, with sensitivity (true positive rate) prioritized over overall accuracy.

Data Preparation & Feature Engineering

Two raw transaction datasets were merged and cleaned in R before any analysis. The preparation pipeline ensured consistency, reproducibility, and suitability for both exploratory analysis and modeling.

Merge & Deduplicate

Two raw CSV files were combined using bind_rows(). Unnecessary identifier columns were removed, and duplicate records were verified and dropped to retain only unique transactions.

Missing Value Imputation

Missing values — scattered across features like age and transaction_day — were imputed using the mice package with predictive mean matching to preserve distributional shape.

Type Standardization

Dates were converted from character to date format. Categorical variables were cast as factors. The "fraud_" prefix was removed from merchant names for clean labeling.

Feature Engineering

Transaction timestamps were parsed into granular temporal features: transaction_hour, day, month, and year. A customer_age field was derived from cardholder date of birth relative to each transaction date. ZIP codes were standardized with leading zeros.

Exploratory Data Analysis

Before modeling, exploratory analysis revealed consistent behavioral and structural patterns that distinguish fraudulent from legitimate transactions. These patterns directly informed feature selection and the design of clustering scenarios.

Key EDA Findings

Dimension	Finding	Implication
Class Balance	~0.5% of transactions are fraudulent	Accuracy is misleading; sensitivity is the right metric
Merchant Category	`grocery_pos`, `shopping_net`, `misc_net` have highest fraud counts	Category is a meaningful discriminating feature
Transaction Amount	Fraud clusters around mid-range amounts, not extremes	Simple amount thresholds would miss most fraud
Time of Day	Fraud peaks sharply at hour 22–23; secondary spike at hours 0–2	Late-night transactions are disproportionately risky
Geography	Some states and lower-population cities show elevated fraud rates	Location provides signal but is not sufficient alone

The concentration of fraud at late-night hours (22–23) and in specific merchant categories like online shopping suggests that fraudsters exploit timing and digital channel vulnerability — patterns that are learnable by behavioral models.

Clustering Approach

Before supervised modeling, unsupervised clustering was applied to identify latent behavioral structures in the transaction data — without using the is_fraud label. The goal was to assess whether behavioral segmentation could separate high-risk transaction groups from low-risk ones.

Nine scenarios were designed across two clustering methods and three feature subsets, plus a no-clustering baseline. Each scenario tested how different variable combinations affect the ability to isolate fraud-prone clusters.

Scenario	Features Used	Clustering	Key Observation
1	All except `city_pop`, `transaction_hour`	K-means	Fraud spread evenly across clusters (<1.3%)
3	All except `category`, `transaction_hour`	K-means	Cluster 3: 35.49% fraud — strong isolation
5	All except `category`, `city_pop`	K-means	Cluster 3: 35.79% fraud — hour drives separation
7	All features	K-means	Strong overall performance, high precision
8	All features	Model-based	Best: highest sensitivity + lowest loss
9	All features	None	Strong baseline — clustering adds incremental value

Scenarios 3 and 5 demonstrated that transaction hour and city population alone — even without merchant category — can isolate a cluster with over 35% fraud rate while keeping all other clusters below 0.5%. This is a striking signal of temporal behavioral structure in fraud patterns.

Predictive Modeling

Cluster assignments were used as engineered features in four supervised classification models. Each model was evaluated on accuracy, sensitivity, precision, AUC, and total financial loss — with sensitivity prioritized to reflect the real-world cost of missed fraud.

01
Baseline

Logistic Regression

Linear baseline providing coefficients and class probabilities. Useful for understanding linear relationships but insufficient for the nonlinear complexity of fraud patterns. Consistently underperformed on sensitivity.

02
Best

Random Forest

Ensemble of decision trees using randomForest. Captured nonlinear feature interactions, handled class imbalance effectively, and produced the best balance of high sensitivity and low total financial loss across all scenarios.

Neural Network (1 hidden layer)

Single hidden layer with 32 neurons using nnet. Captured nonlinear relationships with moderate performance. Sensitivity competitive with Random Forest in several scenarios, but less consistent.

Deep Learning (2 hidden layers)

Two-layer architecture: 32 neurons (layer 1), 16 neurons (layer 2). Delivered high accuracy and strong classification power, particularly in scenarios with full feature sets. Highest accuracy model but slightly lower precision than Random Forest.

Results & Best Model

Sensitivity (true positive rate) across all nine scenarios, showing the best-performing model per scenario:

Scenario 1

0.9753

Scenario 2

0.9584

Scenario 3

0.9702

Scenario 4

0.9692

Scenario 5

0.9209

Scenario 6

0.9820

Scenario 7

0.9805

Scenario 8 ★

0.9830

Scenario 9

0.9820

/ 9

Best scenario
Model-Based Clustering + Random Forest

Scenario	Clustering	Model	Sensitivity	Precision	Loss ($)
1	K-means	Deep Learning	0.9753	0.0737	$4,673
2	Model-based	Random Forest	0.9584	0.1295	$3,588
5	K-means	Random Forest	0.9209	0.0927	$14,905
7	K-means	Random Forest	0.9805	0.1837	$2,933
8 ★	Model-based	Random Forest	0.9830	0.1734	$1,638
9	None	Deep Learning	0.9820	0.1259	$808

Scenario 8 achieves the best balance across all business-relevant metrics: highest sensitivity (0.9830), strong precision (0.1734), and the second-lowest financial loss ($1,638). While Scenario 9 technically shows a lower raw dollar loss, it does so with substantially lower precision — generating more false alarms that burden fraud investigation teams.

Key Learnings & Reflection

01 Fraud exhibits clear behavioral and temporal structure — it is not random. Patterns in merchant type, transaction hour, and city population are learnable signals, not noise. This makes domain-informed feature engineering essential.
02 Unsupervised clustering adds meaningful value when features with natural behavioral boundaries are included. Scenarios 3 and 5 showed that transaction hour and city population alone can isolate clusters with over 35% fraud concentration.
03 Random Forest consistently outperformed simpler and more complex alternatives on the metrics that matter most for business use. Its ensemble structure handles class imbalance, feature interactions, and noisy data well — without requiring hyperparameter tuning to perform competitively.
04 Model evaluation must be framed around business cost, not just accuracy. High sensitivity with moderate precision is the right trade-off here: missing fraud is far more costly than generating a false alert that gets reviewed and cleared.
05 No hyperparameter tuning was performed due to computational constraints. This means the reported results are conservative — all models ran on default settings, and further optimization would likely improve precision without sacrificing sensitivity.

Recommendations & Next Steps

Production Deployment

Combine the Scenario 8 Random Forest model with 2-Factor Authentication (2FA) for flagged transactions — particularly high-risk clusters — to reduce unauthorized access as a real-time safeguard.

Model Improvement

Perform hyperparameter tuning on Random Forest (mtry, ntree) and explore gradient-boosted alternatives like XGBoost with SMOTE oversampling to improve precision without sacrificing sensitivity.

Feature Engineering

Investigate interaction terms — e.g. category × transaction_hour — and velocity features like transaction frequency per cardholder in rolling time windows.

Model Maintenance

Fraud patterns shift over time. Build a continuous retraining pipeline with drift detection so the model adapts to evolving fraud tactics without full manual retraining.

Credit Card FraudDetection & Behavioral Analysis

Problem Context & Objective

Research Questions

Data Overview

Data Preparation & Feature Engineering

Exploratory Data Analysis

Key EDA Findings

Clustering Approach

Predictive Modeling

Results & Best Model

Key Learnings & Reflection

Recommendations & Next Steps

Credit Card Fraud
Detection & Behavioral Analysis