Sep to Dec 2025 Python · Excel

Maximizing Lifetime Return
Through Data-Driven Insights

A full-stack analytics capstone for Vector Controls and Automation Group building a repeatable, sponsor-deployed system that tells the sales team who to prioritize, how to engage, and when to act, powered by behavioral clustering, churn modeling, and industry-level forecasting.

Project Context & Business Problem

Vector Controls and Automation Group (Vector CAG) is an industrial distributor across sectors including Chemical, Oil & Gas, Water & Waste Water, and Food & Beverage. Despite a growing revenue base, the sales team operated with limited analytical visibility, as they had no systematic way to know which customers were drifting toward churn, which industries were peaking or declining, or how to calibrate outreach across a portfolio of thousands of accounts.

The core business problem was twofold. First, revenue was heavily concentrated: a small loyal segment drove the majority of sales, while the bulk of customers purchased irregularly, with no early-warning system to flag deteriorating engagement. Second, sales effort was allocated intuitively rather than analytically, meaning high-value accounts at risk were often treated identically to stable low-value ones.

The overarching goal: build a repeatable, data-driven system that guides the Vector sales team on who to prioritize and when to engage.

Full Project Workflow

The project ran across four months in two modeling tracks clustering and forecasting integrated into a single interactive deliverable. Below is the end-to-end pipeline from raw data to sponsor handoff.

01
Data Acquisition & Scoping Sep 2025
Received raw transactional data from Vector CAG covering 2019 2024. Held sponsor kickoff to align on business questions, define success criteria, and establish data use constraints. Scoped the two modeling tracks: behavioral clustering + churn modeling, and industry-level seasonality + customer growth forecasting.
02
Data Preparation & EDA Sep Oct 2025
Cleaned and merged transaction records. Engineered RFM-style features: recency (days since last order), frequency (order count), monetary value (total and average), tenure, and order volatility.
03
Customer & Industry Clustering My Lead
Led the full clustering workstream. Applied K-Means (K=4) to segment customers into behavioral archetypes and K-Means (K=6) to segment industries into structural market profiles. Validated cluster stability using elbow criterion and silhouette scoring. Produced interpretable, named cluster profiles mapped to engagement strategies.
04
Churn Modeling & Risk Tiering My Lead
Built a monthly panel dataset capturing recency, frequency, monetary, and volatility features per customer. Trained and compared Logistic Regression vs. HistGradientBoosting classifiers. Evaluated on ROC-AUC, precision, and recall. Scored all 2,943 customers with churn probabilities, then binned into P1 P4 risk tiers with expected loss (sales_win × churn_prob) quantified per band.
05
Three-Layer Playbook Design My Lead
Designed the playbook architecture that joins risk tier (Layer 1), customer cluster (Layer 2), and industry cluster (Layer 3) into a single per-account action template. Every customer in the portfolio receives a unique, structured playbook combining outreach intensity, communication style, and industry timing guidance.
06
Industry Forecasting Forecasting Track
Applied STL decomposition to assess trend, seasonality, and forecastability across 12 industries. Selected six priority sectors. Compared ETS vs. SARIMAX models using AIC and volatility ratio to produce 24-month forward revenue forecasts. Mapped industries onto a Growth vs. Stability matrix.
07
Customer-Level Growth Potential Forecasting Track
Forecasted individual customer revenue trajectories using ETS or SARIMAX depending on seasonality strength. Classified customers into a BCG-style matrix (Stars, Cash Cows, Question Marks, Dogs) based on forecasted CAGR and revenue. Ran Base / High / Low scenario simulations to quantify upside sensitivity by segment.
08
Dashboard Build & Sponsor Delivery Dec 2025
Integrated all model outputs into a single self-contained HTML dashboard deployable without a server. Includes interactive customer playbook lookup, KPI monitoring, risk band tables, segmentation overviews, and churn heatmaps. Delivered with full documentation and a sponsor-facing refresh guide.

Data & Exploratory Analysis

The dataset spans six years (2019 2024) of internal order records from Vector CAG, covering 9,000+ unique customers across 12 industry verticals. Each record contains customer ID, order date, order value, product category, and industry classification.

Feature Engineering

Raw transactional records were aggregated to customer-level behavioral features before any modeling. The key engineered dimensions:

FeatureDefinitionWhy It Matters
recency Days since most recent order Primary churn signal long gaps indicate disengagement
frequency Total order count Engagement depth; key driver of loyalty segmentation
monetary Cumulative and average order value Revenue importance; drives expected loss calculation
tenure Days since first order Relationship maturity; differentiates new vs. established accounts
volatility Std dev of monthly order gaps Purchasing stability; erratic patterns signal structural risk
sales_win Rolling 6-month revenue Used in expected loss = sales_win × churn_prob

Key EDA Findings

Exploratory analysis before any modeling surfaced three structural patterns that shaped every downstream decision:

Finding 01
Revenue Concentration
The top ~30% of customers by spend drove the vast majority of revenue. The remaining 70% showed low, irregular activity a long tail of infrequent buyers with no clear churn signal from revenue alone. This made RFM-based behavioral segmentation essential; revenue figures alone were too noisy to prioritize.
Finding 02
U-Shaped Churn Distribution
When churn probabilities were modeled, the distribution was strongly U-shaped: a large share of customers scored either very low (<0.2) or very high (>0.8) risk, with relatively few in the middle. This polarization confirmed that broad outreach campaigns would be wasteful the portfolio demanded segmented, tiered intervention rather than uniform treatment.
Finding 03
Revenue Trend is Stable, Not Declining
Monthly revenue showed a gradual upward trajectory with mild seasonality but no structural decline. This was a critical business context finding: the problem was not recovery from loss, but preventing future churn in a portfolio that was otherwise healthy. Retention priority should focus on protecting the upward trend, not reversing a downturn.

Clustering Customer & Industry Segmentation

I led the full clustering workstream, responsible for feature design, algorithm selection, K validation, cluster profiling, naming, and strategy mapping. The goal was not only classification, but also producing segments interpretable enough to inform daily sales decisions without data science mediation.

Customer Segmentation (K-Means, K=4)

Customer clustering was built on five standardized behavioral features: recency, frequency, total revenue, average order value, and tenure. I evaluated cluster solutions from K=2 through K=8 using both the elbow criterion (within-cluster sum of squares) and silhouette score to measure separation quality.

K=4 was selected over higher-K solutions despite marginal silhouette improvements at K=5 6, because interpretability and cluster stability were the primary constraints. Each of the four clusters needed to map cleanly to a recognizable customer relationship stage that sales managers could act on without a data dictionary.

The four clusters, profiled on median behavioral metrics and named for interpretability:

C0 · ~40% of customers
Large, Low-Value & Inactive
Med. Orders 3
Med. Revenue $5,619
Med. Recency 1,322 days
Med. Tenure 134 days
Largest group by headcount. Very high recency (long-dormant), very low order frequency. Candidates for low-cost reactivation campaigns win-back sequences, replenishment reminders, or structured check-ins to gauge whether accounts are salvageable.
C2 · ~25% of customers
Moderate-Value Newer
Med. Orders 1
Med. Revenue $1,010
Med. Recency 410 days
Med. Tenure 0 days
Newer or recently acquired accounts with limited purchasing history. Zero tenure indicates many are within their first order cycle. Growth opportunity if onboarded effectively. Strategy: usage tips, frequency cadence nudges, cross-sell starter bundles, and structured first-90-days engagement sequences.
C3 · ~2% of customers
High-Return / Unstable
Med. Orders 4
Med. Revenue $647
Med. Recency 1,264 days
Med. Tenure 105 days
Small but structurally unusual cluster showing high return rates relative to revenue and erratic ordering patterns. May reflect problematic accounts, product-fit issues, or procurement volatility. Strategy: root-cause review, SKU replacement discussion, credit/lead-time checks before investing further outreach.

A key design principle: the cluster labels (C0 C3) map to relationship lifecycle stages, not just statistical groupings. This made them immediately actionable sales managers could read "High-Value Loyal" and know exactly what playbook layer it triggered, without consulting the underlying data.

Customer Cluster Profile Heatmap (Z-scored means, K=4)
Customer Cluster Profile Z-scored feature means across K=4 clusters

Industry Segmentation (K-Means, K=6)

A parallel clustering exercise was conducted at the industry level. Rather than individual customers, features were aggregated by industry vertical median order count, revenue, average order value, recency, tenure, and order volatility capturing structural market-level purchasing patterns.

K=6 was selected after evaluating K=3 through K=8. The six-cluster solution produced clearly distinct market profiles that aligned with intuitive business categories, enabling sector-specific coverage and pricing strategies.

ClusterCustomersMed. RevenueMed. OrdersTenureStrategic Profile
I0 · Baseline Stable 583 $53,268 27 1,571 days Standard cadence + EOQ reminders + standing orders
I1 · Project-Based 493 $130,189 42 1,810 days Pre-bid/bid reminders + bulk pricing + delivery scheduling
I2 · Fragmented/Small 96 $54,551 21 1,410 days Micro-offers + lower MOQs + simplified SKU guides
I3 · Unstable/Irregular 428 $1,748 2 0 days Risk review + credit/lead-time checks + small test orders
I4 · Steady High-Volume 1,014 $66,954 27 1,746 days Stock alerts + VMI/blanket PO + auto-replenishment
I5 · Emerging/Niche 329 $28,667 14 1,276 days Trial bundles + sampling/case study + feedback loop

Revenue concentration was stark: over 75% of total industry revenue came from Baseline Stable (I0), Project-Based (I1), and Steady High-Volume (I4) sectors. The Unstable/Irregular cluster (I3) showed near-zero tenure, suggesting a large cohort of newly acquired or dormant industry accounts that had never developed a stable purchasing relationship.

Industry Cluster Profile (Z-scored Values)
Industry Cluster Profile Z-scored feature means across K=6 clusters

Churn Modeling

With behavioral segments established, the next step was predicting churn probability for each customer. Churn was defined behaviorally a customer was labeled churned in a given month if they placed no order for 90+ days following a period of active engagement. This definition was calibrated with the sponsor to reflect realistic sales cycle lengths.

Panel Build
Monthly Feature Panel
Constructed a customer-month panel with rolling features: days since last order, order count in trailing 3/6/12 months, revenue in trailing windows, order gap volatility (standard deviation of inter-order intervals), and cluster membership. This longitudinal structure allowed the model to learn time-varying churn signals rather than static snapshots.
Model Compare
Logistic Regression vs. HistGradientBoosting
Logistic Regression served as the interpretable baseline, providing clear coefficient-level insight into which features drove churn probability. HistGradientBoosting was the primary candidate handling class imbalance, nonlinear interactions, and missing values natively. Both were evaluated on held-out months using ROC-AUC as the primary metric.
Result
HistGradientBoosting Selected · AUC 0.84
HistGradientBoosting achieved ROC-AUC of 0.84, exceeding the target benchmark of ≥0.80, with meaningfully better precision at high-probability thresholds. The final model scored all 2,943 active customers with a continuous churn probability p, forming the foundation of the risk-tier architecture. Model coverage was 100% with no null predictions.
Avg Predicted Churn Probability by Customer Cluster
Avg predicted churn probability by customer cluster C1 (High-Value Loyal) shows the lowest risk at ~0.45
Avg Predicted Churn Probability by Industry Cluster
Avg predicted churn probability by industry cluster I3 (Project-Based) peaks at ~0.86
Churn Probability by Customer × Industry Cluster
Churn probability heatmap Customer × Industry cluster intersections. Highest risk: C2/C3 × I3/I4 at 0.87 0.95

Risk Tiers & Three-Layer Playbook

Churn probabilities were binned into four priority tiers based on the relative risk distribution. The tiering logic was deliberately percentile-based rather than fixed-threshold, ensuring that the P1 group always represented the top 10% of risk regardless of overall model calibration shifts.

Expected loss per customer was computed as sales_win × churn_prob where sales_win was the trailing 6-month revenue a business-relevant metric that weights churn risk by actual revenue at stake, not just probability.

P1 Top 10%
295 customers · $526K loss
High-touch: AM call now
P2 10 30%
588 customers · $3.8M loss
Mid-touch: targeted outreach
P3 30 60%
883 customers · $8.0M loss
Low-touch: automated nurture
P4 60 100%
1,177 customers · $8.0M loss
Monitor only

P1 and P2 together the top 30% of accounts by risk concentrate over 70% of modeled expected churn loss. Focusing sales attention here is not just analytically optimal; it is the only defensible allocation of finite outreach capacity.

The three-layer playbook was the core design innovation of this workstream. Rather than assigning a single action per customer, the playbook stacks three independent recommendations into a unified per-account card:

LayerSourceTells Sales TeamExample
Layer 1 · Risk Churn probability tier (P1 P4) Who to reach out to first and with what urgency High-touch: AM call now + incentive/replenishment
Layer 2 · Customer Customer cluster (C0 C3) How to communicate based on relationship stage Retain & grow: AM call + premium service + early reorder
Layer 3 · Industry Industry cluster (I0 I5) When to engage based on sector demand cycles Steady high-volume: stock alerts + VMI/blanket PO

An example playbook card from the delivered dashboard:

Customer #51166
P3 30 60%
Churn Probability
61.4%
Sales Win (6m)
$1,647,293
Expected Loss
$1,012,145
Clusters
Cust 1 · Ind 0
  • Risk: Low-touch: automated nurture sequence
  • Customer: Retain & grow AM call + premium service + early reorder/stock hold
  • Industry: Baseline stable standard cadence + EOQ reminders + standing orders
Latest 3 Order Dates & Totals
2024-05-16$538,964
2024-05-13$1,108,329
2023-10-24$604,878

Interactive Dashboard Delivery

All clustering, churn, and playbook outputs were consolidated into a single self-contained HTML dashboard no server required, deployable locally by the Vector sales team. The dashboard includes seven sections: key KPIs, revenue and churn patterns, segmentation overviews, at-risk prioritization tables, a Customer × Industry churn heatmap, and a searchable playbook lookup tool where any account manager can type a customer ID and instantly see their three-layer playbook, current risk tier, cluster assignments, and last three order records.

Live Dashboard · Vector CAG Churn & Retention Scroll within to explore

Forecasting Industry & Customer Growth

The forecasting track ran in parallel to clustering and produced two complementary deliverables: industry-level seasonality forecasts and customer-level growth potential segmentation. Together, they translate historical patterns into forward-looking revenue signals that feed directly into the playbook's timing and prioritization logic.

Industry Seasonality

STL decomposition was applied to the top six industries to separate trend, seasonality, and residual noise. This diagnostics step determined which sectors are reliably forecastable and informed model selection ETS for industries with stable, repeatable seasonality (Water & Waste Water, Dealers/Resellers) and SARIMAX for sectors with irregular fluctuations (Chemical, Food & Beverage). Water & Waste Water emerged as the highest long-term growth opportunity, while Food & Beverage and Chemical warrant cautious resource allocation given post-2024 softening signals.

Seasonality & Trend Top 6 Industries (STL Overview)
STL decomposition across top 6 industries separating trend, seasonality, and residual noise to assess forecastability and guide model selection

Customer Growth Potential

At the customer level, each account was assigned ETS or SARIMAX based on its individual seasonality strength, then classified into a BCG-style matrix Stars, Cash Cows, Question Marks, and Dogs using forecasted revenue and CAGR. Scenario simulations (Base / High / Low) applied quadrant-specific uplift logic: +10% for Stars, +30% for Question Marks, damped trend for Cash Cows, and a modest decline for Dogs. Stars dominate total forecasted revenue (~$19 21M), while Question Marks show the highest relative sensitivity to execution, validating their role as the primary growth lever under high-performance conditions.

Scenario Forecast by Segment
Scenario forecast by BCG quadrant (left) and sensitivity analysis High & Low vs Base (right). Stars drive absolute revenue; Question Marks carry the highest relative upside at +17.6%

The forecasting outputs fed directly into the industry cluster strategies (Layer 3 of the playbook) account managers engaging Chemical or Dealers/Resellers customers received timing guidance calibrated to those sectors' seasonal demand peaks, adding a temporal dimension the cluster-only playbook could not provide on its own.

Key Learnings & Reflection

  • 01 Clustering for business use is fundamentally a translation problem. Statistical validity (silhouette score, inertia) gets you to a good solution, but the real work is naming, profiling, and stress-testing clusters against human intuition until they feel like things a sales manager already knows just with clearer boundaries and better labels.
  • 02 The three-layer playbook architecture was more valuable than any individual model. Churn probability alone tells you who is at risk; cluster membership tells you how they behave and what they respond to; industry cluster tells you when to act. Combining all three into a single per-account card removed the need for sales managers to reason across three separate outputs.
  • 03 Expected loss (sales_win × churn_prob) was a more actionable risk metric than churn probability alone. A customer with 99% churn probability but $200 in trailing revenue warrants less urgency than one at 60% probability with $1.6M at stake. Ranking the at-risk list by expected loss focused the P1 table on accounts where intervention had genuine financial consequence.
  • 04 Sponsor deployment constraints shaped technical decisions throughout. The requirement to deliver a single HTML file no server, no dependencies, no login meant every model output had to be pre-computed and embedded. This was a useful forcing function: it prevented over-engineering and kept the focus on what decision the sales team actually needed to make each week.
  • 05 The U-shaped churn distribution was the most important EDA finding of the project. It told us the portfolio was already polarized most customers were either reliably loyal or reliably dormant, with relatively few in the middle. This justified the tiered intervention model over broad campaigns, and shaped how we framed the business case to the sponsor.

Next Steps & Scalability Path

CRM Integration
Embed cluster labels, churn scores, and playbook assignments into Salesforce or the sponsor's CRM for daily visibility without manual dashboard refresh.
Automated Refresh
Build a monthly pipeline that reruns feature engineering, re-scores churn probabilities, and regenerates the dashboard HTML executable by a non-technical user from a documented script.
Early Warning Triggers
Develop threshold-based alerts for sudden churn probability spikes or cluster migrations e.g., a C1 High-Value customer moving toward C0 Inactive behavior as automated email flags to account managers.
Model Enhancement
Incorporate price elasticity signals, product-level purchase patterns, and macroeconomic indicators to improve churn model precision and reduce false positive outreach costs.