Sep to Nov 2024 R

Ad Click-Through Rate
Prediction

A Kaggle-style competition predicting display advertisement CTR. Starting from linear regression baselines and iterating through to Generalized Additive Models, with model performance evaluated using Root Mean Squared Error.

Problem Context & Objective

Display advertising performance is commonly evaluated using Click-Through Rate (CTR) a metric reflecting how effectively an ad attracts user engagement. This project was a Kaggle-style predictive modeling competition with the goal of predicting CTR based on features describing ad quality, relevance, placement, audience, and content.

Model performance was evaluated using Root Mean Squared Error (RMSE), with lower values indicating better predictive accuracy. The goal was not only to minimize prediction error, but to understand how different modeling approaches perform on real-world advertising data.

The modeling strategy was explicitly iterative: each model was expected to build on the weaknesses identified in the previous one, with RMSE as the guiding metric at every stage.

Predictive Analysis Report

Data Overview

The dataset consists of display advertisement records with a unique ad_id and 27 mixed-type features spanning numeric, binary, and categorical domains. Features capture multiple dimensions of ad performance.

Theme	Example Variables	Type
Targeting & Relevance	`targeting_score`, `contextual_relevance`	Numeric
Creative Quality	`visual_appeal`, `headline_length`, `cta_strength`	Numeric
Placement & Delivery	`position_on_page`, `ad_format`, `time_of_day`	Categorical
Audience & Context	`device_type`, `market_saturation`, `ad_frequency`	Mixed
Sentiment	`headline_sentiment`, `body_sentiment`	Numeric
Target	`CTR`	Continuous

Baseline Modeling: Linear Regression

To establish a reference point, I began with simple and multiple linear regression models. These baseline models validated data integrity and provided early insight into linear relationships between ad features and CTR.

Significance Breakdown

Regression results revealed varying levels of predictor significance, informing which features to include in subsequent model iterations:

Significance	Code	Variable Count	Example
Highly significant	`***`	6	`targeting_score`, `visual_appeal`
Moderately significant	`**`	1	`cta_strength`
Mildly significant	`*`	3	`position_on_page`
Weak significance	`.`	6	`headline_sentiment`

R# Fit full linear model as baseline
data_CTR <- read.csv('analysis_data.csv')
modelall <- lm(CTR ~ ., data = data_CTR, na.action = na.exclude)
summary(modelall)

# Compute RMSE on training set
pred_lm <- predict(lm_model, analysis_data)
rmse_lm <- sqrt(mean((analysis_data$CTR - pred_lm)^2))

All four linear model iterations produced RMSE > 0.1 clearly insufficient for the competition target. This motivated a move to nonlinear approaches.

Handling Missing Data & Model Consistency

Before advancing to more complex models, I built a consistent preprocessing pipeline to ensure fair comparisons across model families.

Numeric variables were imputed with the median, categorical variables with the mode, and the ad ID was excluded from modeling. The same pipeline was applied identically to both training and scoring datasets to prevent any leakage or mismatch a step that turned out to matter more than expected when comparing models later.

Model Iteration: Random Forest

After establishing linear baselines, I implemented a Random Forest using the ranger package, training a forest with 1,000 trees for efficiency and stability.

The model used a core set of predictors spanning ad quality, content, and delivery context: targeting_score, visual_appeal, cta_strength, position_on_page, ad_format, device_type, and ad_frequency.

R# Train Random Forest with ranger
forest_ranger <- ranger(
  CTR ~ targeting_score + visual_appeal + cta_strength +
        position_on_page + ad_format + device_type + ad_frequency,
  data = train_clean,
  num.trees = 1000
)

# Generate predictions on scoring data
pred_rf <- predict(forest_ranger, data = scoring_clean)$predictions

# Replace any remaining NAs with median
submission <- scoring_data |>
  mutate(across(CTR, ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

RMSE ~0.10 was a meaningful step forward, but the nonlinear structure of CTR data suggested further gains were possible by modeling smooth predictor relationships explicitly.

Final Model: Generalized Additive Models

The key insight motivating GAM was that several predictors particularly targeting_score and visual_appeal displayed diminishing returns and nonlinear patterns that linear or tree-based models couldn't fully exploit.

I trained a GAM via mgcv using smoothing splines for continuous predictors with nonlinear relationships, while keeping categorical variables in linear form. The REML method was used for smoothing parameter estimation.

R# Fit Generalized Additive Model
y <- data_clean$CTR
gam_model <- gam(
  y ~ s(targeting_score) + s(visual_appeal) + contextual_relevance +
      s(headline_length) + s(cta_strength) +
      position_on_page + ad_format + device_type,
  data  = data_clean,
  method = 'REML'
)

# Predict on scoring dataset
scoring_clean <- bake(data_recipe, new_data = scoring_data)
pred2 <- predict(gam_model, newdata = scoring_clean)

# Build submission file
submission_file <- data.frame(id = scoring_data$id, CTR = pred2)
write.csv(submission_file, 'sample_submission.csv', row.names = FALSE)

Why GAM outperformed Random Forest

Smoothing splines in GAM let each predictor "speak for itself" they can model plateaus, accelerating returns, or threshold effects. For advertising CTR, where increasing targeting_score beyond a point may have diminishing returns, this flexibility is directly relevant. Random forests can approximate this, but GAMs make the structure explicit and interpretable.

Results & Ranking

Linear (baseline)

0.163

Random Forest

~0.100

GAM (final)

0.079

Private RMSE

0.059

/ 387

Top 6% of participants
Final Leaderboard

Model	Public RMSE	Private RMSE	Notes
Linear Regression	0.16284	0.14759	Baseline; all predictors
Random Forest	~0.10		ranger, 1000 trees
GAM (final)	0.07864	0.05916	mgcv, REML, smooth splines

Key Learnings & Reflection

01 Linear models are valuable for establishing baselines, but are often insufficient for capturing real-world behavioral patterns like nonlinear responses and diminishing returns in ad performance.
02 Missing data handling can quietly disrupt feature selection and model compatibility if preprocessing is not standardized identically across training and scoring datasets.
03 More complex models do not automatically lead to better performance. Meaningful gains require disciplined iteration, diagnostic visualization, and alignment between model assumptions and data structure.
04 GAMs provide an underutilized middle ground more interpretable than black-box ensembles while capturing nonlinear patterns that linear models miss entirely.

Next Steps

Exploration

Invest more time in EDA and feature engineering particularly interaction terms between placement and device type.

Modeling

Explore gradient-boosted models (XGBoost, LightGBM) to better balance bias and variance while maintaining generalization.

Ad Click-Through RatePrediction

Problem Context & Objective

Data Overview

Baseline Modeling: Linear Regression

Significance Breakdown

Handling Missing Data & Model Consistency

Model Iteration: Random Forest

Final Model: Generalized Additive Models

Why GAM outperformed Random Forest

Results & Ranking

Key Learnings & Reflection

Next Steps

Ad Click-Through Rate
Prediction