Ad Click-Through Rate
Prediction
A Kaggle-style competition predicting display advertisement CTR. Starting from linear regression baselines and iterating through to Generalized Additive Models, with model performance evaluated using Root Mean Squared Error.
Problem Context & Objective
Display advertising performance is commonly evaluated using Click-Through Rate (CTR) a metric reflecting how effectively an ad attracts user engagement. This project was a Kaggle-style predictive modeling competition with the goal of predicting CTR based on features describing ad quality, relevance, placement, audience, and content.
Model performance was evaluated using Root Mean Squared Error (RMSE), with lower values indicating better predictive accuracy. The goal was not only to minimize prediction error, but to understand how different modeling approaches perform on real-world advertising data.
The modeling strategy was explicitly iterative: each model was expected to build on the weaknesses identified in the previous one, with RMSE as the guiding metric at every stage.
Data Overview
The dataset consists of display advertisement records with a unique ad_id and 27 mixed-type features spanning numeric, binary, and categorical domains. Features capture multiple dimensions of ad performance.
| Theme | Example Variables | Type |
|---|---|---|
| Targeting & Relevance | targeting_score, contextual_relevance |
Numeric |
| Creative Quality | visual_appeal, headline_length, cta_strength |
Numeric |
| Placement & Delivery | position_on_page, ad_format, time_of_day |
Categorical |
| Audience & Context | device_type, market_saturation, ad_frequency |
Mixed |
| Sentiment | headline_sentiment, body_sentiment |
Numeric |
| Target | CTR |
Continuous |
Baseline Modeling: Linear Regression
To establish a reference point, I began with simple and multiple linear regression models. These baseline models validated data integrity and provided early insight into linear relationships between ad features and CTR.
Significance Breakdown
Regression results revealed varying levels of predictor significance, informing which features to include in subsequent model iterations:
| Significance | Code | Variable Count | Example |
|---|---|---|---|
| Highly significant | *** | 6 | targeting_score, visual_appeal |
| Moderately significant | ** | 1 | cta_strength |
| Mildly significant | * | 3 | position_on_page |
| Weak significance | . | 6 | headline_sentiment |
R# Fit full linear model as baseline
data_CTR <- read.csv('analysis_data.csv')
modelall <- lm(CTR ~ ., data = data_CTR, na.action = na.exclude)
summary(modelall)
# Compute RMSE on training set
pred_lm <- predict(lm_model, analysis_data)
rmse_lm <- sqrt(mean((analysis_data$CTR - pred_lm)^2))
All four linear model iterations produced RMSE > 0.1 clearly insufficient for the competition target. This motivated a move to nonlinear approaches.
Handling Missing Data & Model Consistency
Before advancing to more complex models, I built a consistent preprocessing pipeline to ensure fair comparisons across model families.
Numeric variables were imputed with the median, categorical variables with the mode, and the ad ID was excluded from modeling. The same pipeline was applied identically to both training and scoring datasets to prevent any leakage or mismatch a step that turned out to matter more than expected when comparing models later.
Model Iteration: Random Forest
After establishing linear baselines, I implemented a Random Forest using the ranger package, training a forest with 1,000 trees for efficiency and stability.
The model used a core set of predictors spanning ad quality, content, and delivery context: targeting_score, visual_appeal, cta_strength, position_on_page, ad_format, device_type, and ad_frequency.
R# Train Random Forest with ranger
forest_ranger <- ranger(
CTR ~ targeting_score + visual_appeal + cta_strength +
position_on_page + ad_format + device_type + ad_frequency,
data = train_clean,
num.trees = 1000
)
# Generate predictions on scoring data
pred_rf <- predict(forest_ranger, data = scoring_clean)$predictions
# Replace any remaining NAs with median
submission <- scoring_data |>
mutate(across(CTR, ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
RMSE ~0.10 was a meaningful step forward, but the nonlinear structure of CTR data suggested further gains were possible by modeling smooth predictor relationships explicitly.
Final Model: Generalized Additive Models
The key insight motivating GAM was that several predictors particularly targeting_score and visual_appeal displayed diminishing returns and nonlinear patterns that linear or tree-based models couldn't fully exploit.
I trained a GAM via mgcv using smoothing splines for continuous predictors with nonlinear relationships, while keeping categorical variables in linear form. The REML method was used for smoothing parameter estimation.
R# Fit Generalized Additive Model
y <- data_clean$CTR
gam_model <- gam(
y ~ s(targeting_score) + s(visual_appeal) + contextual_relevance +
s(headline_length) + s(cta_strength) +
position_on_page + ad_format + device_type,
data = data_clean,
method = 'REML'
)
# Predict on scoring dataset
scoring_clean <- bake(data_recipe, new_data = scoring_data)
pred2 <- predict(gam_model, newdata = scoring_clean)
# Build submission file
submission_file <- data.frame(id = scoring_data$id, CTR = pred2)
write.csv(submission_file, 'sample_submission.csv', row.names = FALSE)
Why GAM outperformed Random Forest
Smoothing splines in GAM let each predictor "speak for itself" they can model plateaus, accelerating returns, or threshold effects. For advertising CTR, where increasing targeting_score beyond a point may have diminishing returns, this flexibility is directly relevant. Random forests can approximate this, but GAMs make the structure explicit and interpretable.
Results & Ranking
Final Leaderboard
| Model | Public RMSE | Private RMSE | Notes |
|---|---|---|---|
| Linear Regression | 0.16284 | 0.14759 | Baseline; all predictors |
| Random Forest | ~0.10 | ranger, 1000 trees | |
| GAM (final) | 0.07864 | 0.05916 | mgcv, REML, smooth splines |
Key Learnings & Reflection
- 01 Linear models are valuable for establishing baselines, but are often insufficient for capturing real-world behavioral patterns like nonlinear responses and diminishing returns in ad performance.
- 02 Missing data handling can quietly disrupt feature selection and model compatibility if preprocessing is not standardized identically across training and scoring datasets.
- 03 More complex models do not automatically lead to better performance. Meaningful gains require disciplined iteration, diagnostic visualization, and alignment between model assumptions and data structure.
- 04 GAMs provide an underutilized middle ground more interpretable than black-box ensembles while capturing nonlinear patterns that linear models miss entirely.