Sep to Nov 2024 R

Ad Click-Through Rate
Prediction

A Kaggle-style competition predicting display advertisement CTR. Starting from linear regression baselines and iterating through to Generalized Additive Models, with model performance evaluated using Root Mean Squared Error.

Initial Model · Linear Regression
0.16284
Public RMSE  ·  Private: 0.14759
Final Model · GAM ↓ 51.7%
0.07864
Public RMSE  ·  Private: 0.05916

Problem Context & Objective

Display advertising performance is commonly evaluated using Click-Through Rate (CTR) a metric reflecting how effectively an ad attracts user engagement. This project was a Kaggle-style predictive modeling competition with the goal of predicting CTR based on features describing ad quality, relevance, placement, audience, and content.

Model performance was evaluated using Root Mean Squared Error (RMSE), with lower values indicating better predictive accuracy. The goal was not only to minimize prediction error, but to understand how different modeling approaches perform on real-world advertising data.

The modeling strategy was explicitly iterative: each model was expected to build on the weaknesses identified in the previous one, with RMSE as the guiding metric at every stage.

Data Overview

The dataset consists of display advertisement records with a unique ad_id and 27 mixed-type features spanning numeric, binary, and categorical domains. Features capture multiple dimensions of ad performance.

Theme Example Variables Type
Targeting & Relevance targeting_score, contextual_relevance Numeric
Creative Quality visual_appeal, headline_length, cta_strength Numeric
Placement & Delivery position_on_page, ad_format, time_of_day Categorical
Audience & Context device_type, market_saturation, ad_frequency Mixed
Sentiment headline_sentiment, body_sentiment Numeric
Target CTR Continuous

Baseline Modeling: Linear Regression

To establish a reference point, I began with simple and multiple linear regression models. These baseline models validated data integrity and provided early insight into linear relationships between ad features and CTR.

Significance Breakdown

Regression results revealed varying levels of predictor significance, informing which features to include in subsequent model iterations:

Significance Code Variable Count Example
Highly significant***6targeting_score, visual_appeal
Moderately significant**1cta_strength
Mildly significant*3position_on_page
Weak significance.6headline_sentiment
R# Fit full linear model as baseline
data_CTR <- read.csv('analysis_data.csv')
modelall <- lm(CTR ~ ., data = data_CTR, na.action = na.exclude)
summary(modelall)

# Compute RMSE on training set
pred_lm <- predict(lm_model, analysis_data)
rmse_lm <- sqrt(mean((analysis_data$CTR - pred_lm)^2))

All four linear model iterations produced RMSE > 0.1 clearly insufficient for the competition target. This motivated a move to nonlinear approaches.

Handling Missing Data & Model Consistency

Before advancing to more complex models, I built a consistent preprocessing pipeline to ensure fair comparisons across model families.

Numeric variables were imputed with the median, categorical variables with the mode, and the ad ID was excluded from modeling. The same pipeline was applied identically to both training and scoring datasets to prevent any leakage or mismatch a step that turned out to matter more than expected when comparing models later.

Model Iteration: Random Forest

After establishing linear baselines, I implemented a Random Forest using the ranger package, training a forest with 1,000 trees for efficiency and stability.

The model used a core set of predictors spanning ad quality, content, and delivery context: targeting_score, visual_appeal, cta_strength, position_on_page, ad_format, device_type, and ad_frequency.

R# Train Random Forest with ranger
forest_ranger <- ranger(
  CTR ~ targeting_score + visual_appeal + cta_strength +
        position_on_page + ad_format + device_type + ad_frequency,
  data = train_clean,
  num.trees = 1000
)

# Generate predictions on scoring data
pred_rf <- predict(forest_ranger, data = scoring_clean)$predictions

# Replace any remaining NAs with median
submission <- scoring_data |>
  mutate(across(CTR, ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))

RMSE ~0.10 was a meaningful step forward, but the nonlinear structure of CTR data suggested further gains were possible by modeling smooth predictor relationships explicitly.

Final Model: Generalized Additive Models

The key insight motivating GAM was that several predictors particularly targeting_score and visual_appeal displayed diminishing returns and nonlinear patterns that linear or tree-based models couldn't fully exploit.

I trained a GAM via mgcv using smoothing splines for continuous predictors with nonlinear relationships, while keeping categorical variables in linear form. The REML method was used for smoothing parameter estimation.

R# Fit Generalized Additive Model
y <- data_clean$CTR
gam_model <- gam(
  y ~ s(targeting_score) + s(visual_appeal) + contextual_relevance +
      s(headline_length) + s(cta_strength) +
      position_on_page + ad_format + device_type,
  data  = data_clean,
  method = 'REML'
)

# Predict on scoring dataset
scoring_clean <- bake(data_recipe, new_data = scoring_data)
pred2 <- predict(gam_model, newdata = scoring_clean)

# Build submission file
submission_file <- data.frame(id = scoring_data$id, CTR = pred2)
write.csv(submission_file, 'sample_submission.csv', row.names = FALSE)

Why GAM outperformed Random Forest

Smoothing splines in GAM let each predictor "speak for itself" they can model plateaus, accelerating returns, or threshold effects. For advertising CTR, where increasing targeting_score beyond a point may have diminishing returns, this flexibility is directly relevant. Random forests can approximate this, but GAMs make the structure explicit and interpretable.

Results & Ranking

Linear (baseline)
0.163
Random Forest
~0.100
GAM (final)
0.079
Private RMSE
0.059
18
/ 387
Top 6% of participants
Final Leaderboard
Model Public RMSE Private RMSE Notes
Linear Regression 0.16284 0.14759 Baseline; all predictors
Random Forest ~0.10 ranger, 1000 trees
GAM (final) 0.07864 0.05916 mgcv, REML, smooth splines

Key Learnings & Reflection

  • 01 Linear models are valuable for establishing baselines, but are often insufficient for capturing real-world behavioral patterns like nonlinear responses and diminishing returns in ad performance.
  • 02 Missing data handling can quietly disrupt feature selection and model compatibility if preprocessing is not standardized identically across training and scoring datasets.
  • 03 More complex models do not automatically lead to better performance. Meaningful gains require disciplined iteration, diagnostic visualization, and alignment between model assumptions and data structure.
  • 04 GAMs provide an underutilized middle ground more interpretable than black-box ensembles while capturing nonlinear patterns that linear models miss entirely.

Next Steps

Exploration
Invest more time in EDA and feature engineering particularly interaction terms between placement and device type.
Modeling
Explore gradient-boosted models (XGBoost, LightGBM) to better balance bias and variance while maintaining generalization.