Sep to Dec 2025 Python · Excel

Maximizing Lifetime Return
Through Data-Driven Insights

A full-stack analytics capstone for Vector Controls and Automation Group building a repeatable, sponsor-deployed system that tells the sales team who to prioritize, how to engage, and when to act, powered by behavioral clustering, churn modeling, and industry-level forecasting.

Project Context & Business Problem

Vector Controls and Automation Group (Vector CAG) is an industrial distributor across sectors including Chemical, Oil & Gas, Water & Waste Water, and Food & Beverage. Despite a growing revenue base, the sales team operated with limited analytical visibility, as they had no systematic way to know which customers were drifting toward churn, which industries were peaking or declining, or how to calibrate outreach across a portfolio of thousands of accounts.

The core business problem was twofold. First, revenue was heavily concentrated: a small loyal segment drove the majority of sales, while the bulk of customers purchased irregularly, with no early-warning system to flag deteriorating engagement. Second, sales effort was allocated intuitively rather than analytically, meaning high-value accounts at risk were often treated identically to stable low-value ones.

The overarching goal: build a repeatable, data-driven system that guides the Vector sales team on who to prioritize and when to engage.

Final Presentation Detailed Results

Full Project Workflow

The project ran across four months in two modeling tracks clustering and forecasting integrated into a single interactive deliverable. Below is the end-to-end pipeline from raw data to sponsor handoff.

Data Acquisition & Scoping Sep 2025

Received raw transactional data from Vector CAG covering 2019 2024. Held sponsor kickoff to align on business questions, define success criteria, and establish data use constraints. Scoped the two modeling tracks: behavioral clustering + churn modeling, and industry-level seasonality + customer growth forecasting.

Data Preparation & EDA Sep Oct 2025

Cleaned and merged transaction records. Engineered RFM-style features: recency (days since last order), frequency (order count), monetary value (total and average), tenure, and order volatility.

Customer & Industry Clustering My Lead

Led the full clustering workstream. Applied K-Means (K=4) to segment customers into behavioral archetypes and K-Means (K=6) to segment industries into structural market profiles. Validated cluster stability using elbow criterion and silhouette scoring. Produced interpretable, named cluster profiles mapped to engagement strategies.

Churn Modeling & Risk Tiering My Lead

Built a monthly panel dataset capturing recency, frequency, monetary, and volatility features per customer. Trained and compared Logistic Regression vs. HistGradientBoosting classifiers. Evaluated on ROC-AUC, precision, and recall. Scored all 2,943 customers with churn probabilities, then binned into P1 P4 risk tiers with expected loss (sales_win × churn_prob) quantified per band.

Three-Layer Playbook Design My Lead

Designed the playbook architecture that joins risk tier (Layer 1), customer cluster (Layer 2), and industry cluster (Layer 3) into a single per-account action template. Every customer in the portfolio receives a unique, structured playbook combining outreach intensity, communication style, and industry timing guidance.

Industry Forecasting Forecasting Track

Applied STL decomposition to assess trend, seasonality, and forecastability across 12 industries. Selected six priority sectors. Compared ETS vs. SARIMAX models using AIC and volatility ratio to produce 24-month forward revenue forecasts. Mapped industries onto a Growth vs. Stability matrix.

Customer-Level Growth Potential Forecasting Track

Forecasted individual customer revenue trajectories using ETS or SARIMAX depending on seasonality strength. Classified customers into a BCG-style matrix (Stars, Cash Cows, Question Marks, Dogs) based on forecasted CAGR and revenue. Ran Base / High / Low scenario simulations to quantify upside sensitivity by segment.

Dashboard Build & Sponsor Delivery Dec 2025

Integrated all model outputs into a single self-contained HTML dashboard deployable without a server. Includes interactive customer playbook lookup, KPI monitoring, risk band tables, segmentation overviews, and churn heatmaps. Delivered with full documentation and a sponsor-facing refresh guide.

Data & Exploratory Analysis

The dataset spans six years (2019 2024) of internal order records from Vector CAG, covering 9,000+ unique customers across 12 industry verticals. Each record contains customer ID, order date, order value, product category, and industry classification.

Feature Engineering

Raw transactional records were aggregated to customer-level behavioral features before any modeling. The key engineered dimensions:

Feature	Definition	Why It Matters
`recency`	Days since most recent order	Primary churn signal long gaps indicate disengagement
`frequency`	Total order count	Engagement depth; key driver of loyalty segmentation
`monetary`	Cumulative and average order value	Revenue importance; drives expected loss calculation
`tenure`	Days since first order	Relationship maturity; differentiates new vs. established accounts
`volatility`	Std dev of monthly order gaps	Purchasing stability; erratic patterns signal structural risk
`sales_win`	Rolling 6-month revenue	Used in expected loss = sales_win × churn_prob

Key EDA Findings

Exploratory analysis before any modeling surfaced three structural patterns that shaped every downstream decision:

Finding 01

Revenue Concentration

The top ~30% of customers by spend drove the vast majority of revenue. The remaining 70% showed low, irregular activity a long tail of infrequent buyers with no clear churn signal from revenue alone. This made RFM-based behavioral segmentation essential; revenue figures alone were too noisy to prioritize.

Finding 02

U-Shaped Churn Distribution

When churn probabilities were modeled, the distribution was strongly U-shaped: a large share of customers scored either very low (<0.2) or very high (>0.8) risk, with relatively few in the middle. This polarization confirmed that broad outreach campaigns would be wasteful the portfolio demanded segmented, tiered intervention rather than uniform treatment.

Finding 03

Revenue Trend is Stable, Not Declining

Monthly revenue showed a gradual upward trajectory with mild seasonality but no structural decline. This was a critical business context finding: the problem was not recovery from loss, but preventing future churn in a portfolio that was otherwise healthy. Retention priority should focus on protecting the upward trend, not reversing a downturn.

Clustering Customer & Industry Segmentation

I led the full clustering workstream, responsible for feature design, algorithm selection, K validation, cluster profiling, naming, and strategy mapping. The goal was not only classification, but also producing segments interpretable enough to inform daily sales decisions without data science mediation.

Customer Segmentation (K-Means, K=4)

Customer clustering was built on five standardized behavioral features: recency, frequency, total revenue, average order value, and tenure. I evaluated cluster solutions from K=2 through K=8 using both the elbow criterion (within-cluster sum of squares) and silhouette score to measure separation quality.

K=4 was selected over higher-K solutions despite marginal silhouette improvements at K=5 6, because interpretability and cluster stability were the primary constraints. Each of the four clusters needed to map cleanly to a recognizable customer relationship stage that sales managers could act on without a data dictionary.

The four clusters, profiled on median behavioral metrics and named for interpretability:

C0 · ~40% of customers

Large, Low-Value & Inactive

Med. Orders 3

Med. Revenue $5,619

Med. Recency 1,322 days

Med. Tenure 134 days

Largest group by headcount. Very high recency (long-dormant), very low order frequency. Candidates for low-cost reactivation campaigns win-back sequences, replenishment reminders, or structured check-ins to gauge whether accounts are salvageable.

C1 · ~30% of customers

High-Value Loyal

Med. Orders 26

Med. Revenue $66,639

Med. Recency 119 days

Med. Tenure 1,713 days

Core strategic accounts. High frequency, high spend, long tenure, low recency. Despite representing only ~30% of customers, this cluster drives the majority of revenue. Strategy: dedicated AM relationships, premium service, proactive renewal outreach, and early detection of any engagement softening.

C2 · ~25% of customers

Moderate-Value Newer

Med. Orders 1

Med. Revenue $1,010

Med. Recency 410 days

Med. Tenure 0 days

Newer or recently acquired accounts with limited purchasing history. Zero tenure indicates many are within their first order cycle. Growth opportunity if onboarded effectively. Strategy: usage tips, frequency cadence nudges, cross-sell starter bundles, and structured first-90-days engagement sequences.

C3 · ~2% of customers

High-Return / Unstable

Med. Orders 4

Med. Revenue $647

Med. Recency 1,264 days

Med. Tenure 105 days

Small but structurally unusual cluster showing high return rates relative to revenue and erratic ordering patterns. May reflect problematic accounts, product-fit issues, or procurement volatility. Strategy: root-cause review, SKU replacement discussion, credit/lead-time checks before investing further outreach.

A key design principle: the cluster labels (C0 C3) map to relationship lifecycle stages, not just statistical groupings. This made them immediately actionable sales managers could read "High-Value Loyal" and know exactly what playbook layer it triggered, without consulting the underlying data.

Customer Cluster Profile Z-scored feature means across K=4 clusters

Industry Segmentation (K-Means, K=6)

A parallel clustering exercise was conducted at the industry level. Rather than individual customers, features were aggregated by industry vertical median order count, revenue, average order value, recency, tenure, and order volatility capturing structural market-level purchasing patterns.

K=6 was selected after evaluating K=3 through K=8. The six-cluster solution produced clearly distinct market profiles that aligned with intuitive business categories, enabling sector-specific coverage and pricing strategies.

Cluster	Customers	Med. Revenue	Med. Orders	Tenure	Strategic Profile
I0 · Baseline Stable	583	$53,268	27	1,571 days	Standard cadence + EOQ reminders + standing orders
I1 · Project-Based	493	$130,189	42	1,810 days	Pre-bid/bid reminders + bulk pricing + delivery scheduling
I2 · Fragmented/Small	96	$54,551	21	1,410 days	Micro-offers + lower MOQs + simplified SKU guides
I3 · Unstable/Irregular	428	$1,748	2	0 days	Risk review + credit/lead-time checks + small test orders
I4 · Steady High-Volume	1,014	$66,954	27	1,746 days	Stock alerts + VMI/blanket PO + auto-replenishment
I5 · Emerging/Niche	329	$28,667	14	1,276 days	Trial bundles + sampling/case study + feedback loop

Revenue concentration was stark: over 75% of total industry revenue came from Baseline Stable (I0), Project-Based (I1), and Steady High-Volume (I4) sectors. The Unstable/Irregular cluster (I3) showed near-zero tenure, suggesting a large cohort of newly acquired or dormant industry accounts that had never developed a stable purchasing relationship.

Industry Cluster Profile Z-scored feature means across K=6 clusters

Churn Modeling

With behavioral segments established, the next step was predicting churn probability for each customer. Churn was defined behaviorally a customer was labeled churned in a given month if they placed no order for 90+ days following a period of active engagement. This definition was calibrated with the sponsor to reflect realistic sales cycle lengths.

Panel Build

Monthly Feature Panel

Constructed a customer-month panel with rolling features: days since last order, order count in trailing 3/6/12 months, revenue in trailing windows, order gap volatility (standard deviation of inter-order intervals), and cluster membership. This longitudinal structure allowed the model to learn time-varying churn signals rather than static snapshots.

Model Compare

Logistic Regression vs. HistGradientBoosting

Logistic Regression served as the interpretable baseline, providing clear coefficient-level insight into which features drove churn probability. HistGradientBoosting was the primary candidate handling class imbalance, nonlinear interactions, and missing values natively. Both were evaluated on held-out months using ROC-AUC as the primary metric.

Result

HistGradientBoosting Selected · AUC 0.84

HistGradientBoosting achieved ROC-AUC of 0.84, exceeding the target benchmark of ≥0.80, with meaningfully better precision at high-probability thresholds. The final model scored all 2,943 active customers with a continuous churn probability p, forming the foundation of the risk-tier architecture. Model coverage was 100% with no null predictions.

Avg predicted churn probability by customer cluster C1 (High-Value Loyal) shows the lowest risk at ~0.45

Avg predicted churn probability by industry cluster I3 (Project-Based) peaks at ~0.86

Churn Probability by Customer × Industry Cluster

Churn probability heatmap Customer × Industry cluster intersections. Highest risk: C2/C3 × I3/I4 at 0.87 0.95

Risk Tiers & Three-Layer Playbook

Churn probabilities were binned into four priority tiers based on the relative risk distribution. The tiering logic was deliberately percentile-based rather than fixed-threshold, ensuring that the P1 group always represented the top 10% of risk regardless of overall model calibration shifts.

Expected loss per customer was computed as sales_win × churn_prob where sales_win was the trailing 6-month revenue a business-relevant metric that weights churn risk by actual revenue at stake, not just probability.

P1 Top 10%

295 customers · $526K loss

High-touch: AM call now

P2 10 30%

588 customers · $3.8M loss

Mid-touch: targeted outreach

P3 30 60%

883 customers · $8.0M loss

Low-touch: automated nurture

P4 60 100%

1,177 customers · $8.0M loss

Monitor only

P1 and P2 together the top 30% of accounts by risk concentrate over 70% of modeled expected churn loss. Focusing sales attention here is not just analytically optimal; it is the only defensible allocation of finite outreach capacity.

The three-layer playbook was the core design innovation of this workstream. Rather than assigning a single action per customer, the playbook stacks three independent recommendations into a unified per-account card:

Layer	Source	Tells Sales Team	Example
Layer 1 · Risk	Churn probability tier (P1 P4)	Who to reach out to first and with what urgency	High-touch: AM call now + incentive/replenishment
Layer 2 · Customer	Customer cluster (C0 C3)	How to communicate based on relationship stage	Retain & grow: AM call + premium service + early reorder
Layer 3 · Industry	Industry cluster (I0 I5)	When to engage based on sector demand cycles	Steady high-volume: stock alerts + VMI/blanket PO

An example playbook card from the delivered dashboard:

Customer #51166

P3 30 60%

Churn Probability

61.4%

Sales Win (6m)

$1,647,293

Expected Loss

$1,012,145

Clusters

Cust 1 · Ind 0

Risk: Low-touch: automated nurture sequence
Customer: Retain & grow AM call + premium service + early reorder/stock hold
Industry: Baseline stable standard cadence + EOQ reminders + standing orders

Latest 3 Order Dates & Totals

2024-05-16$538,964

2024-05-13$1,108,329

2023-10-24$604,878

Interactive Dashboard Delivery

All clustering, churn, and playbook outputs were consolidated into a single self-contained HTML dashboard no server required, deployable locally by the Vector sales team. The dashboard includes seven sections: key KPIs, revenue and churn patterns, segmentation overviews, at-risk prioritization tables, a Customer × Industry churn heatmap, and a searchable playbook lookup tool where any account manager can type a customer ID and instantly see their three-layer playbook, current risk tier, cluster assignments, and last three order records.

Live Dashboard · Vector CAG Churn & Retention Scroll within to explore

Forecasting Industry & Customer Growth

The forecasting track ran in parallel to clustering and produced two complementary deliverables: industry-level seasonality forecasts and customer-level growth potential segmentation. Together, they translate historical patterns into forward-looking revenue signals that feed directly into the playbook's timing and prioritization logic.

Industry Seasonality

STL decomposition was applied to the top six industries to separate trend, seasonality, and residual noise. This diagnostics step determined which sectors are reliably forecastable and informed model selection ETS for industries with stable, repeatable seasonality (Water & Waste Water, Dealers/Resellers) and SARIMAX for sectors with irregular fluctuations (Chemical, Food & Beverage). Water & Waste Water emerged as the highest long-term growth opportunity, while Food & Beverage and Chemical warrant cautious resource allocation given post-2024 softening signals.

Seasonality & Trend Top 6 Industries (STL Overview)

STL decomposition across top 6 industries separating trend, seasonality, and residual noise to assess forecastability and guide model selection

Customer Growth Potential

At the customer level, each account was assigned ETS or SARIMAX based on its individual seasonality strength, then classified into a BCG-style matrix Stars, Cash Cows, Question Marks, and Dogs using forecasted revenue and CAGR. Scenario simulations (Base / High / Low) applied quadrant-specific uplift logic: +10% for Stars, +30% for Question Marks, damped trend for Cash Cows, and a modest decline for Dogs. Stars dominate total forecasted revenue (~$19 21M), while Question Marks show the highest relative sensitivity to execution, validating their role as the primary growth lever under high-performance conditions.

Scenario forecast by BCG quadrant (left) and sensitivity analysis High & Low vs Base (right). Stars drive absolute revenue; Question Marks carry the highest relative upside at +17.6%

The forecasting outputs fed directly into the industry cluster strategies (Layer 3 of the playbook) account managers engaging Chemical or Dealers/Resellers customers received timing guidance calibrated to those sectors' seasonal demand peaks, adding a temporal dimension the cluster-only playbook could not provide on its own.

Key Learnings & Reflection

01 Clustering for business use is fundamentally a translation problem. Statistical validity (silhouette score, inertia) gets you to a good solution, but the real work is naming, profiling, and stress-testing clusters against human intuition until they feel like things a sales manager already knows just with clearer boundaries and better labels.
02 The three-layer playbook architecture was more valuable than any individual model. Churn probability alone tells you who is at risk; cluster membership tells you how they behave and what they respond to; industry cluster tells you when to act. Combining all three into a single per-account card removed the need for sales managers to reason across three separate outputs.
03 Expected loss (sales_win × churn_prob) was a more actionable risk metric than churn probability alone. A customer with 99% churn probability but $200 in trailing revenue warrants less urgency than one at 60% probability with $1.6M at stake. Ranking the at-risk list by expected loss focused the P1 table on accounts where intervention had genuine financial consequence.
04 Sponsor deployment constraints shaped technical decisions throughout. The requirement to deliver a single HTML file no server, no dependencies, no login meant every model output had to be pre-computed and embedded. This was a useful forcing function: it prevented over-engineering and kept the focus on what decision the sales team actually needed to make each week.
05 The U-shaped churn distribution was the most important EDA finding of the project. It told us the portfolio was already polarized most customers were either reliably loyal or reliably dormant, with relatively few in the middle. This justified the tiered intervention model over broad campaigns, and shaped how we framed the business case to the sponsor.

Next Steps & Scalability Path

CRM Integration

Embed cluster labels, churn scores, and playbook assignments into Salesforce or the sponsor's CRM for daily visibility without manual dashboard refresh.

Automated Refresh

Build a monthly pipeline that reruns feature engineering, re-scores churn probabilities, and regenerates the dashboard HTML executable by a non-technical user from a documented script.

Early Warning Triggers

Develop threshold-based alerts for sudden churn probability spikes or cluster migrations e.g., a C1 High-Value customer moving toward C0 Inactive behavior as automated email flags to account managers.

Model Enhancement

Incorporate price elasticity signals, product-level purchase patterns, and macroeconomic indicators to improve churn model precision and reduce false positive outreach costs.

Maximizing Lifetime ReturnThrough Data-Driven Insights

Project Context & Business Problem

Full Project Workflow

Data & Exploratory Analysis

Feature Engineering

Key EDA Findings

Clustering Customer & Industry Segmentation

Customer Segmentation (K-Means, K=4)

Industry Segmentation (K-Means, K=6)

Churn Modeling

Risk Tiers & Three-Layer Playbook

Interactive Dashboard Delivery

Forecasting Industry & Customer Growth

Industry Seasonality

Customer Growth Potential

Key Learnings & Reflection

Next Steps & Scalability Path

Maximizing Lifetime Return
Through Data-Driven Insights