Credit Card Fraud
Detection & Behavioral Analysis
Analyzing 1.8 million credit card transactions to detect fraud through unsupervised behavioral clustering and supervised classification, combining K-means and model-based segmentation with Random Forest, Neural Networks, and Deep Learning.
Problem Context & Objective
The rapid digitization of financial services has created unprecedented convenience — and unprecedented fraud exposure. As customers rely on online and mobile payment systems, fraud tactics evolve rapidly, affecting millions of people and billions of dollars globally. Merchants worldwide now face a wider variety of fraud attacks than ever before, with first-party misuse, account takeovers, and triangulation schemes all on the rise.
This project investigates a central question: how do transactional and behavioral factors affect the likelihood of a person being a fraud victim? Rather than treating fraud detection as a black-box classification task, the goal was to understand why certain transactions are fraudulent — and whether unsupervised behavioral segmentation could improve detection beyond traditional feature-based models.
Three research questions guided the analysis: whether merchant type, city population size, and transaction hour each independently predict fraud likelihood — and how combining all three performs against using them in isolation.
Research Questions
| RQ | Question | Predictor (X) | Outcome (Y) |
|---|---|---|---|
| RQ1 | Do certain merchant types have higher fraud incidence? | category |
is_fraud |
| RQ2 | Are merchants in more populated cities more fraud-prone? | city_pop |
is_fraud |
| RQ3 | Does transaction hour affect fraud probability? | transaction_hour |
is_fraud |
Data Overview
The dataset contains simulated credit card transactions spanning 2019–2020, drawn from a publicly available Kaggle source. It covers 693 distinct merchants and 999 unique cardholders across the United States, with over 1.8 million transaction records.
| Feature Group | Example Variables | Type |
|---|---|---|
| Transaction Details | amt, trans_date_trans_time, trans_num | Numeric / DateTime |
| Merchant Info | merchant, category, merch_lat, merch_long | Categorical / Numeric |
| Cardholder Demographics | first, last, gender, dob, job | Categorical |
| Location | city, state, zip, lat, long, city_pop | Mixed |
| Engineered Features | transaction_hour, customer_age | Numeric |
| Target | is_fraud | Binary (0/1) |
The dataset is highly imbalanced — fraudulent transactions comprise approximately 0.5% of all records. This imbalance was a central consideration throughout model design and evaluation, with sensitivity (true positive rate) prioritized over overall accuracy.
Data Preparation & Feature Engineering
Two raw transaction datasets were merged and cleaned in R before any analysis. The preparation pipeline ensured consistency, reproducibility, and suitability for both exploratory analysis and modeling.
bind_rows(). Unnecessary identifier columns were removed, and duplicate records were verified and dropped to retain only unique transactions.age and transaction_day — were imputed using the mice package with predictive mean matching to preserve distributional shape."fraud_" prefix was removed from merchant names for clean labeling.transaction_hour, day, month, and year. A customer_age field was derived from cardholder date of birth relative to each transaction date. ZIP codes were standardized with leading zeros.Exploratory Data Analysis
Before modeling, exploratory analysis revealed consistent behavioral and structural patterns that distinguish fraudulent from legitimate transactions. These patterns directly informed feature selection and the design of clustering scenarios.
Key EDA Findings
| Dimension | Finding | Implication |
|---|---|---|
| Class Balance | ~0.5% of transactions are fraudulent | Accuracy is misleading; sensitivity is the right metric |
| Merchant Category | grocery_pos, shopping_net, misc_net have highest fraud counts |
Category is a meaningful discriminating feature |
| Transaction Amount | Fraud clusters around mid-range amounts, not extremes | Simple amount thresholds would miss most fraud |
| Time of Day | Fraud peaks sharply at hour 22–23; secondary spike at hours 0–2 | Late-night transactions are disproportionately risky |
| Geography | Some states and lower-population cities show elevated fraud rates | Location provides signal but is not sufficient alone |
The concentration of fraud at late-night hours (22–23) and in specific merchant categories like online shopping suggests that fraudsters exploit timing and digital channel vulnerability — patterns that are learnable by behavioral models.
Clustering Approach
Before supervised modeling, unsupervised clustering was applied to identify latent behavioral structures in the transaction data — without using the is_fraud label. The goal was to assess whether behavioral segmentation could separate high-risk transaction groups from low-risk ones.
Nine scenarios were designed across two clustering methods and three feature subsets, plus a no-clustering baseline. Each scenario tested how different variable combinations affect the ability to isolate fraud-prone clusters.
| Scenario | Features Used | Clustering | Key Observation |
|---|---|---|---|
| 1 | All except city_pop, transaction_hour | K-means | Fraud spread evenly across clusters (<1.3%) |
| 3 | All except category, transaction_hour | K-means | Cluster 3: 35.49% fraud — strong isolation |
| 5 | All except category, city_pop | K-means | Cluster 3: 35.79% fraud — hour drives separation |
| 7 | All features | K-means | Strong overall performance, high precision |
| 8 | All features | Model-based | Best: highest sensitivity + lowest loss |
| 9 | All features | None | Strong baseline — clustering adds incremental value |
Scenarios 3 and 5 demonstrated that transaction hour and city population alone — even without merchant category — can isolate a cluster with over 35% fraud rate while keeping all other clusters below 0.5%. This is a striking signal of temporal behavioral structure in fraud patterns.
Predictive Modeling
Cluster assignments were used as engineered features in four supervised classification models. Each model was evaluated on accuracy, sensitivity, precision, AUC, and total financial loss — with sensitivity prioritized to reflect the real-world cost of missed fraud.
Baseline
Best
randomForest. Captured nonlinear feature interactions, handled class imbalance effectively, and produced the best balance of high sensitivity and low total financial loss across all scenarios.nnet. Captured nonlinear relationships with moderate performance. Sensitivity competitive with Random Forest in several scenarios, but less consistent.Results & Best Model
Sensitivity (true positive rate) across all nine scenarios, showing the best-performing model per scenario:
Model-Based Clustering + Random Forest
| Scenario | Clustering | Model | Sensitivity | Precision | Loss ($) |
|---|---|---|---|---|---|
| 1 | K-means | Deep Learning | 0.9753 | 0.0737 | $4,673 |
| 2 | Model-based | Random Forest | 0.9584 | 0.1295 | $3,588 |
| 5 | K-means | Random Forest | 0.9209 | 0.0927 | $14,905 |
| 7 | K-means | Random Forest | 0.9805 | 0.1837 | $2,933 |
| 8 ★ | Model-based | Random Forest | 0.9830 | 0.1734 | $1,638 |
| 9 | None | Deep Learning | 0.9820 | 0.1259 | $808 |
Scenario 8 achieves the best balance across all business-relevant metrics: highest sensitivity (0.9830), strong precision (0.1734), and the second-lowest financial loss ($1,638). While Scenario 9 technically shows a lower raw dollar loss, it does so with substantially lower precision — generating more false alarms that burden fraud investigation teams.
Key Learnings & Reflection
- 01 Fraud exhibits clear behavioral and temporal structure — it is not random. Patterns in merchant type, transaction hour, and city population are learnable signals, not noise. This makes domain-informed feature engineering essential.
- 02 Unsupervised clustering adds meaningful value when features with natural behavioral boundaries are included. Scenarios 3 and 5 showed that transaction hour and city population alone can isolate clusters with over 35% fraud concentration.
- 03 Random Forest consistently outperformed simpler and more complex alternatives on the metrics that matter most for business use. Its ensemble structure handles class imbalance, feature interactions, and noisy data well — without requiring hyperparameter tuning to perform competitively.
- 04 Model evaluation must be framed around business cost, not just accuracy. High sensitivity with moderate precision is the right trade-off here: missing fraud is far more costly than generating a false alert that gets reviewed and cleared.
- 05 No hyperparameter tuning was performed due to computational constraints. This means the reported results are conservative — all models ran on default settings, and further optimization would likely improve precision without sacrificing sensitivity.