Maximizing Lifetime Return
Through Data-Driven Insights
A full-stack analytics capstone for Vector Controls and Automation Group building a repeatable, sponsor-deployed system that tells the sales team who to prioritize, how to engage, and when to act, powered by behavioral clustering, churn modeling, and industry-level forecasting.
Project Context & Business Problem
Vector Controls and Automation Group (Vector CAG) is an industrial distributor across sectors including Chemical, Oil & Gas, Water & Waste Water, and Food & Beverage. Despite a growing revenue base, the sales team operated with limited analytical visibility, as they had no systematic way to know which customers were drifting toward churn, which industries were peaking or declining, or how to calibrate outreach across a portfolio of thousands of accounts.
The core business problem was twofold. First, revenue was heavily concentrated: a small loyal segment drove the majority of sales, while the bulk of customers purchased irregularly, with no early-warning system to flag deteriorating engagement. Second, sales effort was allocated intuitively rather than analytically, meaning high-value accounts at risk were often treated identically to stable low-value ones.
The overarching goal: build a repeatable, data-driven system that guides the Vector sales team on who to prioritize and when to engage.
Full Project Workflow
The project ran across four months in two modeling tracks clustering and forecasting integrated into a single interactive deliverable. Below is the end-to-end pipeline from raw data to sponsor handoff.
Data & Exploratory Analysis
The dataset spans six years (2019 2024) of internal order records from Vector CAG, covering 9,000+ unique customers across 12 industry verticals. Each record contains customer ID, order date, order value, product category, and industry classification.
Feature Engineering
Raw transactional records were aggregated to customer-level behavioral features before any modeling. The key engineered dimensions:
| Feature | Definition | Why It Matters |
|---|---|---|
recency |
Days since most recent order | Primary churn signal long gaps indicate disengagement |
frequency |
Total order count | Engagement depth; key driver of loyalty segmentation |
monetary |
Cumulative and average order value | Revenue importance; drives expected loss calculation |
tenure |
Days since first order | Relationship maturity; differentiates new vs. established accounts |
volatility |
Std dev of monthly order gaps | Purchasing stability; erratic patterns signal structural risk |
sales_win |
Rolling 6-month revenue | Used in expected loss = sales_win × churn_prob |
Key EDA Findings
Exploratory analysis before any modeling surfaced three structural patterns that shaped every downstream decision:
Clustering Customer & Industry Segmentation
I led the full clustering workstream, responsible for feature design, algorithm selection, K validation, cluster profiling, naming, and strategy mapping. The goal was not only classification, but also producing segments interpretable enough to inform daily sales decisions without data science mediation.
Customer Segmentation (K-Means, K=4)
Customer clustering was built on five standardized behavioral features: recency, frequency, total revenue, average order value, and tenure. I evaluated cluster solutions from K=2 through K=8 using both the elbow criterion (within-cluster sum of squares) and silhouette score to measure separation quality.
K=4 was selected over higher-K solutions despite marginal silhouette improvements at K=5 6, because interpretability and cluster stability were the primary constraints. Each of the four clusters needed to map cleanly to a recognizable customer relationship stage that sales managers could act on without a data dictionary.
The four clusters, profiled on median behavioral metrics and named for interpretability:
A key design principle: the cluster labels (C0 C3) map to relationship lifecycle stages, not just statistical groupings. This made them immediately actionable sales managers could read "High-Value Loyal" and know exactly what playbook layer it triggered, without consulting the underlying data.
Industry Segmentation (K-Means, K=6)
A parallel clustering exercise was conducted at the industry level. Rather than individual customers, features were aggregated by industry vertical median order count, revenue, average order value, recency, tenure, and order volatility capturing structural market-level purchasing patterns.
K=6 was selected after evaluating K=3 through K=8. The six-cluster solution produced clearly distinct market profiles that aligned with intuitive business categories, enabling sector-specific coverage and pricing strategies.
| Cluster | Customers | Med. Revenue | Med. Orders | Tenure | Strategic Profile |
|---|---|---|---|---|---|
| I0 · Baseline Stable | 583 | $53,268 | 27 | 1,571 days | Standard cadence + EOQ reminders + standing orders |
| I1 · Project-Based | 493 | $130,189 | 42 | 1,810 days | Pre-bid/bid reminders + bulk pricing + delivery scheduling |
| I2 · Fragmented/Small | 96 | $54,551 | 21 | 1,410 days | Micro-offers + lower MOQs + simplified SKU guides |
| I3 · Unstable/Irregular | 428 | $1,748 | 2 | 0 days | Risk review + credit/lead-time checks + small test orders |
| I4 · Steady High-Volume | 1,014 | $66,954 | 27 | 1,746 days | Stock alerts + VMI/blanket PO + auto-replenishment |
| I5 · Emerging/Niche | 329 | $28,667 | 14 | 1,276 days | Trial bundles + sampling/case study + feedback loop |
Revenue concentration was stark: over 75% of total industry revenue came from Baseline Stable (I0), Project-Based (I1), and Steady High-Volume (I4) sectors. The Unstable/Irregular cluster (I3) showed near-zero tenure, suggesting a large cohort of newly acquired or dormant industry accounts that had never developed a stable purchasing relationship.
Churn Modeling
With behavioral segments established, the next step was predicting churn probability for each customer. Churn was defined behaviorally a customer was labeled churned in a given month if they placed no order for 90+ days following a period of active engagement. This definition was calibrated with the sponsor to reflect realistic sales cycle lengths.
p, forming the foundation of the risk-tier architecture. Model coverage was 100% with no null predictions.Risk Tiers & Three-Layer Playbook
Churn probabilities were binned into four priority tiers based on the relative risk distribution. The tiering logic was deliberately percentile-based rather than fixed-threshold, ensuring that the P1 group always represented the top 10% of risk regardless of overall model calibration shifts.
Expected loss per customer was computed as sales_win × churn_prob where sales_win was the trailing 6-month revenue a business-relevant metric that weights churn risk by actual revenue at stake, not just probability.
P1 and P2 together the top 30% of accounts by risk concentrate over 70% of modeled expected churn loss. Focusing sales attention here is not just analytically optimal; it is the only defensible allocation of finite outreach capacity.
The three-layer playbook was the core design innovation of this workstream. Rather than assigning a single action per customer, the playbook stacks three independent recommendations into a unified per-account card:
| Layer | Source | Tells Sales Team | Example |
|---|---|---|---|
| Layer 1 · Risk | Churn probability tier (P1 P4) | Who to reach out to first and with what urgency | High-touch: AM call now + incentive/replenishment |
| Layer 2 · Customer | Customer cluster (C0 C3) | How to communicate based on relationship stage | Retain & grow: AM call + premium service + early reorder |
| Layer 3 · Industry | Industry cluster (I0 I5) | When to engage based on sector demand cycles | Steady high-volume: stock alerts + VMI/blanket PO |
An example playbook card from the delivered dashboard:
- Risk: Low-touch: automated nurture sequence
- Customer: Retain & grow AM call + premium service + early reorder/stock hold
- Industry: Baseline stable standard cadence + EOQ reminders + standing orders
Interactive Dashboard Delivery
All clustering, churn, and playbook outputs were consolidated into a single self-contained HTML dashboard no server required, deployable locally by the Vector sales team. The dashboard includes seven sections: key KPIs, revenue and churn patterns, segmentation overviews, at-risk prioritization tables, a Customer × Industry churn heatmap, and a searchable playbook lookup tool where any account manager can type a customer ID and instantly see their three-layer playbook, current risk tier, cluster assignments, and last three order records.
Forecasting Industry & Customer Growth
The forecasting track ran in parallel to clustering and produced two complementary deliverables: industry-level seasonality forecasts and customer-level growth potential segmentation. Together, they translate historical patterns into forward-looking revenue signals that feed directly into the playbook's timing and prioritization logic.
Industry Seasonality
STL decomposition was applied to the top six industries to separate trend, seasonality, and residual noise. This diagnostics step determined which sectors are reliably forecastable and informed model selection ETS for industries with stable, repeatable seasonality (Water & Waste Water, Dealers/Resellers) and SARIMAX for sectors with irregular fluctuations (Chemical, Food & Beverage). Water & Waste Water emerged as the highest long-term growth opportunity, while Food & Beverage and Chemical warrant cautious resource allocation given post-2024 softening signals.
Customer Growth Potential
At the customer level, each account was assigned ETS or SARIMAX based on its individual seasonality strength, then classified into a BCG-style matrix Stars, Cash Cows, Question Marks, and Dogs using forecasted revenue and CAGR. Scenario simulations (Base / High / Low) applied quadrant-specific uplift logic: +10% for Stars, +30% for Question Marks, damped trend for Cash Cows, and a modest decline for Dogs. Stars dominate total forecasted revenue (~$19 21M), while Question Marks show the highest relative sensitivity to execution, validating their role as the primary growth lever under high-performance conditions.
The forecasting outputs fed directly into the industry cluster strategies (Layer 3 of the playbook) account managers engaging Chemical or Dealers/Resellers customers received timing guidance calibrated to those sectors' seasonal demand peaks, adding a temporal dimension the cluster-only playbook could not provide on its own.
Key Learnings & Reflection
- 01 Clustering for business use is fundamentally a translation problem. Statistical validity (silhouette score, inertia) gets you to a good solution, but the real work is naming, profiling, and stress-testing clusters against human intuition until they feel like things a sales manager already knows just with clearer boundaries and better labels.
- 02 The three-layer playbook architecture was more valuable than any individual model. Churn probability alone tells you who is at risk; cluster membership tells you how they behave and what they respond to; industry cluster tells you when to act. Combining all three into a single per-account card removed the need for sales managers to reason across three separate outputs.
- 03 Expected loss (sales_win × churn_prob) was a more actionable risk metric than churn probability alone. A customer with 99% churn probability but $200 in trailing revenue warrants less urgency than one at 60% probability with $1.6M at stake. Ranking the at-risk list by expected loss focused the P1 table on accounts where intervention had genuine financial consequence.
- 04 Sponsor deployment constraints shaped technical decisions throughout. The requirement to deliver a single HTML file no server, no dependencies, no login meant every model output had to be pre-computed and embedded. This was a useful forcing function: it prevented over-engineering and kept the focus on what decision the sales team actually needed to make each week.
- 05 The U-shaped churn distribution was the most important EDA finding of the project. It told us the portfolio was already polarized most customers were either reliably loyal or reliably dormant, with relatively few in the middle. This justified the tiered intervention model over broad campaigns, and shaped how we framed the business case to the sponsor.