PAC Report - Jiarong Guo

Simple and Multipe Regression

In the initial submissions, I decided to use different regression models. To find the best-performing predictors, I ran modelall where I fitted a multiple linear regression model with CTR as the target variable and all other variables in the dataset as predictors. As a result, the regression results revealed varying levels of significance among the predictors: 6 variables were highly significant with three (*), 1 was moderately significant (**), 3 were mildly significant (*), and 6 showed weak significance (.). These findings provide a basis for constructing four models with different set of predictors, with the wish to have lower RMSE each time we make predictions. All of these models results in RMSE bigger than 0.1, which is not ideal. Therefore, I decided to use other models to improve the prediction accuracy.

data_CTR = read.csv('analysis_data.csv')
#summary(data_CTR)
#names(data_CTR)

modelall <- lm(CTR ~ ., data = data_CTR, na.action = na.exclude )
summary(modelall)

## 
## Call:
## lm(formula = CTR ~ ., data = data_CTR, na.action = na.exclude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31752 -0.07831 -0.01120  0.05641  1.76157 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -3.130e-02  2.179e-02  -1.437 0.150952    
## id                          -1.361e-06  9.565e-07  -1.423 0.154768    
## targeting_score              2.723e-02  8.823e-04  30.864  < 2e-16 ***
## visual_appeal                2.450e-02  5.898e-04  41.531  < 2e-16 ***
## contextual_relevance         1.472e-02  4.931e-03   2.986 0.002849 ** 
## headline_length             -1.208e-03  1.511e-04  -7.992 1.86e-15 ***
## cta_strength                 1.074e-02  8.889e-04  12.081  < 2e-16 ***
## position_on_pageSide Banner -6.199e-03  6.116e-03  -1.013 0.310922    
## position_on_pageTop Banner   1.082e-02  6.087e-03   1.778 0.075558 .  
## ad_formatText                2.705e-02  1.301e-02   2.080 0.037585 *  
## ad_formatVideo               7.652e-02  7.537e-03  10.154  < 2e-16 ***
## age_group25-34               1.545e-03  6.626e-03   0.233 0.815652    
## age_group35-44              -5.574e-03  7.332e-03  -0.760 0.447175    
## age_group45-54              -4.200e-03  8.159e-03  -0.515 0.606724    
## age_group55-64              -2.588e-02  1.598e-02  -1.620 0.105338    
## age_group65-74               8.585e-03  1.989e-02   0.432 0.666012    
## age_group75-84              -3.506e-02  2.212e-02  -1.585 0.113084    
## age_group85+                 8.346e-02  9.716e-02   0.859 0.390421    
## genderMale                  -3.351e-03  4.979e-03  -0.673 0.501019    
## genderOther                  9.063e-03  1.711e-02   0.530 0.596457    
## locationNortheast           -1.376e-02  7.411e-03  -1.856 0.063546 .  
## locationSouth               -1.369e-02  6.481e-03  -2.112 0.034733 *  
## locationWest                -9.941e-03  7.461e-03  -1.332 0.182845    
## time_of_dayEvening          -7.762e-03  6.691e-03  -1.160 0.246117    
## time_of_dayMorning           1.112e-02  5.974e-03   1.861 0.062816 .  
## time_of_dayNight            -1.168e-02  8.741e-03  -1.336 0.181642    
## day_of_weekMonday            1.756e-02  8.471e-03   2.073 0.038260 *  
## day_of_weekSaturday          7.510e-03  9.702e-03   0.774 0.438956    
## day_of_weekSunday            1.627e-02  9.305e-03   1.748 0.080536 .  
## day_of_weekThursday          9.945e-03  8.545e-03   1.164 0.244533    
## day_of_weekTuesday          -5.276e-03  8.396e-03  -0.628 0.529778    
## day_of_weekWednesday         9.014e-03  8.375e-03   1.076 0.281875    
## brand_familiarity            1.541e-03  8.259e-04   1.865 0.062239 .  
## device_typeMobile            2.006e-02  5.506e-03   3.643 0.000274 ***
## device_typeTablet            1.515e-02  1.820e-02   0.832 0.405356    
## ad_frequency                 1.270e-03  8.635e-04   1.470 0.141532    
## market_saturation           -1.515e-03  8.797e-04  -1.722 0.085163 .  
## seasonality                  1.769e-03  4.923e-03   0.359 0.719301    
## headline_sentiment          -8.609e-04  1.215e-03  -0.709 0.478520    
## headline_word_count         -1.045e-03  1.073e-03  -0.973 0.330382    
## headline_power_words        -3.103e-03  4.920e-03  -0.631 0.528345    
## body_text_length             5.479e-06  6.456e-05   0.085 0.932371    
## body_word_count             -1.004e-04  3.185e-04  -0.315 0.752582    
## body_sentiment               8.921e-04  1.236e-03   0.722 0.470533    
## headline_question            3.300e-03  4.927e-03   0.670 0.503026    
## headline_numbers            -4.313e-03  4.935e-03  -0.874 0.382112    
## body_keyword_density        -1.251e-02  9.618e-02  -0.130 0.896507    
## body_readability_score      -1.405e-05  1.715e-04  -0.082 0.934729    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1365 on 3066 degrees of freedom
##   (886 observations deleted due to missingness)
## Multiple R-squared:  0.5127, Adjusted R-squared:  0.5052 
## F-statistic: 68.63 on 47 and 3066 DF,  p-value: < 2.2e-16

Model 1

Model 1 focuses on six predictors that are highly relevant to CTR: targeting_score, visual_appeal, headline_length, cta_strength, ad_format, and device_type. These variables were selected based on their theoretical importance in influencing user engagement. This model serves as a simple baseline to understand how these core predictors explain CTR.

model1 <- lm(CTR ~ targeting_score + visual_appeal + headline_length + cta_strength+ad_format + device_type, data = data_CTR)
#summary(model1)

scoring_data = read.csv('scoring_data.csv')
regress1_Pred <- predict(model1, newdata = scoring_data)
#regress1_Pred

submission_file_1 = data.frame(id = scoring_data$id, CTR = regress1_Pred)  # Construct submission from predictions
write.csv(submission_file_1, 'sample_submission_model1.csv',row.names = F)

Model 2

Model 2 builds on Model 1 by incorporating contextual_relevance, a variable that captures whether the ad aligns with the content it is displayed alongside. Adding this predictor acknowledges the potential influence of context on user engagement. The addition aims to improve the explanatory power of the model, offering a more nuanced understanding of the factors impacting CTR.

model2 <- lm(CTR ~ targeting_score + visual_appeal + headline_length + cta_strength+ad_format + device_type + contextual_relevance, data = data_CTR)
#summary(model2)

scoring_data = read.csv('scoring_data.csv')
regress2_Pred <- predict(model2, newdata = scoring_data)
# regress2_Pred

submission_file_2 = data.frame(id = scoring_data$id, CTR = regress2_Pred)  # Construct submission from predictions
write.csv(submission_file_2, 'sample_submission_model2.csv',row.names = F)

Model 3

Model 3 extends Model 2 by including location (location) and temporal factors (day_of_week). These variables were introduced to capture variations in CTR based on regional and time-based patterns. This model aims to address the potential influence of external factors on user behavior, making it more comprehensive. Results indicated improvements in model performance and predictive accuracy for validation data.

model3 <- lm(CTR ~ targeting_score + visual_appeal + headline_length + cta_strength+ad_format + device_type + contextual_relevance + location + day_of_week, data = data_CTR)
#summary(model3)

scoring_data = read.csv('scoring_data.csv')
regress3_Pred <- predict(model3, newdata = scoring_data)
# regress3_Pred

submission_file_3 = data.frame(id = scoring_data$id, CTR = regress3_Pred) # Construct submission from predictions
write.csv(submission_file_3, 'sample_submission_model3.csv',row.names = F)

Model 4

Model 4 further expands on previous models by adding predictors like position_on_page, time_of_day, brand_familiarity, and market_saturation. These variables were chosen to capture additional dimensions of user interaction, such as the impact of ad placement, familiarity with the brand, and competition. This model attempts to incorporate a broader set of features to explain CTR variability, offering a detailed perspective on user engagement.

model4 <- lm(CTR ~ targeting_score + visual_appeal + headline_length + cta_strength+ad_format + device_type + contextual_relevance + location + day_of_week + position_on_page + time_of_day + brand_familiarity + market_saturation, data = data_CTR)
#summary(model4)

scoring_data = read.csv('scoring_data.csv')
regress4_Pred <- predict(model4, newdata = scoring_data)
# regress4_Pred

submission_file_4 = data.frame(id = scoring_data$id, CTR = regress4_Pred) # Construct submission from predictions
write.csv(submission_file_4, 'sample_submission_model4.csv',row.names = F)

Model 5: Stepwise Regression

Model 5 uses stepwise regression to automate feature selection, optimizing the choice of predictors. Both forward and backward selection were applied to identify the most significant variables while excluding redundant ones. This approach balances model simplicity with predictive performance. The final stepwise model provided a streamlined predictor set, making it a practical choice for CTR prediction.

data_CTR_na <- na.omit(data_CTR)
modelall <- lm(CTR ~ ., data = data_CTR_na )
Step5 <- step(modelall, direction = "both")

## Start:  AIC=-12357.18
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     age_group + gender + location + time_of_day + day_of_week + 
##     brand_familiarity + device_type + ad_frequency + market_saturation + 
##     seasonality + headline_sentiment + headline_word_count + 
##     headline_power_words + body_text_length + body_word_count + 
##     body_sentiment + headline_question + headline_numbers + body_keyword_density + 
##     body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - age_group               7     0.137 57.224 -12364
## - gender                  2     0.016 57.102 -12360
## - body_readability_score  1     0.000 57.087 -12359
## - body_text_length        1     0.000 57.087 -12359
## - body_keyword_density    1     0.000 57.087 -12359
## - body_word_count         1     0.002 57.088 -12359
## - day_of_week             6     0.186 57.272 -12359
## - seasonality             1     0.002 57.089 -12359
## - headline_power_words    1     0.007 57.094 -12359
## - headline_question       1     0.008 57.095 -12359
## - headline_sentiment      1     0.009 57.096 -12359
## - body_sentiment          1     0.010 57.096 -12359
## - headline_numbers        1     0.014 57.101 -12358
## - headline_word_count     1     0.018 57.104 -12358
## - location                3     0.098 57.184 -12358
## <none>                                57.087 -12357
## - id                      1     0.038 57.124 -12357
## - ad_frequency            1     0.040 57.127 -12357
## - market_saturation       1     0.055 57.142 -12356
## - brand_familiarity       1     0.065 57.151 -12356
## - time_of_day             3     0.191 57.278 -12353
## - position_on_page        2     0.157 57.243 -12353
## - contextual_relevance    1     0.166 57.253 -12350
## - device_type             2     0.247 57.334 -12348
## - headline_length         1     1.189 58.276 -12295
## - cta_strength            1     2.718 59.804 -12214
## - ad_format               2     2.891 59.977 -12207
## - targeting_score         1    17.737 74.823 -11517
## - visual_appeal           1    32.115 89.201 -10969
## 
## Step:  AIC=-12363.71
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     gender + location + time_of_day + day_of_week + brand_familiarity + 
##     device_type + ad_frequency + market_saturation + seasonality + 
##     headline_sentiment + headline_word_count + headline_power_words + 
##     body_text_length + body_word_count + body_sentiment + headline_question + 
##     headline_numbers + body_keyword_density + body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - gender                  2     0.016 57.239 -12367
## - day_of_week             6     0.181 57.404 -12366
## - body_text_length        1     0.000 57.224 -12366
## - body_keyword_density    1     0.000 57.224 -12366
## - body_readability_score  1     0.000 57.224 -12366
## - body_word_count         1     0.002 57.226 -12366
## - seasonality             1     0.003 57.227 -12366
## - headline_question       1     0.006 57.230 -12365
## - headline_power_words    1     0.006 57.230 -12365
## - body_sentiment          1     0.010 57.233 -12365
## - headline_sentiment      1     0.011 57.234 -12365
## - headline_word_count     1     0.017 57.240 -12365
## - headline_numbers        1     0.017 57.241 -12365
## - location                3     0.095 57.319 -12365
## - id                      1     0.036 57.259 -12364
## <none>                                57.224 -12364
## - ad_frequency            1     0.045 57.269 -12363
## - market_saturation       1     0.060 57.283 -12362
## - brand_familiarity       1     0.065 57.289 -12362
## - time_of_day             3     0.198 57.422 -12359
## - position_on_page        2     0.162 57.386 -12359
## + age_group               7     0.137 57.087 -12357
## - contextual_relevance    1     0.165 57.389 -12357
## - device_type             2     0.247 57.471 -12354
## - headline_length         1     1.196 58.420 -12301
## - cta_strength            1     2.778 60.002 -12218
## - ad_format               2     2.857 60.080 -12216
## - targeting_score         1    17.707 74.931 -11526
## - visual_appeal           1    32.213 89.437 -10975
## 
## Step:  AIC=-12366.86
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + day_of_week + brand_familiarity + 
##     device_type + ad_frequency + market_saturation + seasonality + 
##     headline_sentiment + headline_word_count + headline_power_words + 
##     body_text_length + body_word_count + body_sentiment + headline_question + 
##     headline_numbers + body_keyword_density + body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - day_of_week             6     0.183 57.422 -12369
## - body_text_length        1     0.000 57.239 -12369
## - body_keyword_density    1     0.000 57.239 -12369
## - body_readability_score  1     0.000 57.239 -12369
## - body_word_count         1     0.002 57.241 -12369
## - seasonality             1     0.003 57.242 -12369
## - headline_power_words    1     0.006 57.245 -12368
## - headline_question       1     0.007 57.246 -12368
## - body_sentiment          1     0.009 57.248 -12368
## - headline_sentiment      1     0.010 57.250 -12368
## - headline_word_count     1     0.017 57.256 -12368
## - headline_numbers        1     0.017 57.256 -12368
## - location                3     0.095 57.334 -12368
## - id                      1     0.035 57.275 -12367
## <none>                                57.239 -12367
## - ad_frequency            1     0.046 57.285 -12366
## - market_saturation       1     0.060 57.300 -12366
## - brand_familiarity       1     0.065 57.304 -12365
## + gender                  2     0.016 57.224 -12364
## - position_on_page        2     0.162 57.401 -12362
## - time_of_day             3     0.199 57.439 -12362
## + age_group               7     0.137 57.102 -12360
## - contextual_relevance    1     0.164 57.403 -12360
## - device_type             2     0.248 57.488 -12357
## - headline_length         1     1.197 58.436 -12304
## - cta_strength            1     2.782 60.021 -12221
## - ad_format               2     2.858 60.098 -12219
## - targeting_score         1    17.711 74.950 -11529
## - visual_appeal           1    32.212 89.452 -10979
## 
## Step:  AIC=-12368.93
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + seasonality + headline_sentiment + 
##     headline_word_count + headline_power_words + body_text_length + 
##     body_word_count + body_sentiment + headline_question + headline_numbers + 
##     body_keyword_density + body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - body_text_length        1     0.000 57.422 -12371
## - body_readability_score  1     0.000 57.422 -12371
## - body_keyword_density    1     0.000 57.422 -12371
## - body_word_count         1     0.001 57.423 -12371
## - seasonality             1     0.003 57.425 -12371
## - headline_question       1     0.006 57.428 -12371
## - headline_power_words    1     0.006 57.428 -12371
## - body_sentiment          1     0.010 57.433 -12370
## - headline_sentiment      1     0.012 57.434 -12370
## - location                3     0.086 57.508 -12370
## - headline_word_count     1     0.012 57.434 -12370
## - headline_numbers        1     0.018 57.440 -12370
## - id                      1     0.037 57.459 -12369
## <none>                                57.422 -12369
## - ad_frequency            1     0.045 57.467 -12368
## - market_saturation       1     0.057 57.479 -12368
## - brand_familiarity       1     0.058 57.480 -12368
## + day_of_week             6     0.183 57.239 -12367
## + gender                  2     0.018 57.404 -12366
## - position_on_page        2     0.156 57.578 -12364
## - time_of_day             3     0.210 57.632 -12364
## + age_group               7     0.132 57.290 -12362
## - contextual_relevance    1     0.168 57.590 -12362
## - device_type             2     0.253 57.675 -12359
## - headline_length         1     1.192 58.614 -12307
## - ad_format               2     2.822 60.244 -12224
## - cta_strength            1     2.805 60.227 -12222
## - targeting_score         1    17.826 75.248 -11529
## - visual_appeal           1    32.336 89.758 -10980
## 
## Step:  AIC=-12370.93
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + seasonality + headline_sentiment + 
##     headline_word_count + headline_power_words + body_word_count + 
##     body_sentiment + headline_question + headline_numbers + body_keyword_density + 
##     body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - body_readability_score  1     0.000 57.422 -12373
## - body_keyword_density    1     0.000 57.422 -12373
## - body_word_count         1     0.001 57.423 -12373
## - seasonality             1     0.003 57.425 -12373
## - headline_question       1     0.006 57.428 -12373
## - headline_power_words    1     0.006 57.428 -12373
## - body_sentiment          1     0.010 57.433 -12372
## - headline_sentiment      1     0.012 57.434 -12372
## - location                3     0.086 57.508 -12372
## - headline_word_count     1     0.012 57.434 -12372
## - headline_numbers        1     0.018 57.440 -12372
## - id                      1     0.037 57.459 -12371
## <none>                                57.422 -12371
## - ad_frequency            1     0.045 57.467 -12370
## - market_saturation       1     0.057 57.479 -12370
## - brand_familiarity       1     0.058 57.480 -12370
## + body_text_length        1     0.000 57.422 -12369
## + day_of_week             6     0.183 57.239 -12369
## + gender                  2     0.018 57.404 -12368
## - position_on_page        2     0.156 57.579 -12366
## - time_of_day             3     0.210 57.633 -12366
## + age_group               7     0.132 57.290 -12364
## - contextual_relevance    1     0.168 57.590 -12364
## - device_type             2     0.253 57.675 -12361
## - headline_length         1     1.192 58.615 -12309
## - ad_format               2     2.822 60.244 -12226
## - cta_strength            1     2.805 60.227 -12224
## - targeting_score         1    17.841 75.263 -11530
## - visual_appeal           1    32.349 89.771 -10982
## 
## Step:  AIC=-12372.92
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + seasonality + headline_sentiment + 
##     headline_word_count + headline_power_words + body_word_count + 
##     body_sentiment + headline_question + headline_numbers + body_keyword_density
## 
##                          Df Sum of Sq    RSS    AIC
## - body_keyword_density    1     0.000 57.423 -12375
## - body_word_count         1     0.001 57.424 -12375
## - seasonality             1     0.003 57.425 -12375
## - headline_question       1     0.006 57.429 -12375
## - headline_power_words    1     0.006 57.429 -12375
## - body_sentiment          1     0.010 57.433 -12374
## - headline_sentiment      1     0.012 57.434 -12374
## - headline_word_count     1     0.012 57.435 -12374
## - location                3     0.086 57.508 -12374
## - headline_numbers        1     0.018 57.440 -12374
## - id                      1     0.037 57.459 -12373
## <none>                                57.422 -12373
## - ad_frequency            1     0.045 57.467 -12372
## - market_saturation       1     0.057 57.480 -12372
## - brand_familiarity       1     0.058 57.480 -12372
## + body_readability_score  1     0.000 57.422 -12371
## + body_text_length        1     0.000 57.422 -12371
## + day_of_week             6     0.183 57.239 -12371
## + gender                  2     0.018 57.405 -12370
## - position_on_page        2     0.156 57.579 -12368
## - time_of_day             3     0.211 57.633 -12368
## + age_group               7     0.132 57.290 -12366
## - contextual_relevance    1     0.168 57.591 -12366
## - device_type             2     0.253 57.675 -12363
## - headline_length         1     1.192 58.615 -12311
## - ad_format               2     2.823 60.245 -12228
## - cta_strength            1     2.805 60.227 -12226
## - targeting_score         1    17.842 75.264 -11532
## - visual_appeal           1    32.349 89.771 -10984
## 
## Step:  AIC=-12374.91
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + seasonality + headline_sentiment + 
##     headline_word_count + headline_power_words + body_word_count + 
##     body_sentiment + headline_question + headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - body_word_count         1     0.001 57.424 -12377
## - seasonality             1     0.003 57.425 -12377
## - headline_power_words    1     0.006 57.429 -12377
## - headline_question       1     0.006 57.429 -12377
## - body_sentiment          1     0.011 57.433 -12376
## - headline_sentiment      1     0.012 57.434 -12376
## - location                3     0.086 57.509 -12376
## - headline_word_count     1     0.012 57.435 -12376
## - headline_numbers        1     0.018 57.441 -12376
## - id                      1     0.037 57.459 -12375
## <none>                                57.423 -12375
## - ad_frequency            1     0.045 57.468 -12374
## - market_saturation       1     0.057 57.480 -12374
## - brand_familiarity       1     0.058 57.481 -12374
## + body_keyword_density    1     0.000 57.422 -12373
## + body_readability_score  1     0.000 57.422 -12373
## + body_text_length        1     0.000 57.423 -12373
## + day_of_week             6     0.183 57.240 -12373
## + gender                  2     0.018 57.405 -12372
## - position_on_page        2     0.157 57.579 -12370
## - time_of_day             3     0.210 57.633 -12370
## + age_group               7     0.131 57.291 -12368
## - contextual_relevance    1     0.169 57.591 -12368
## - device_type             2     0.254 57.676 -12365
## - headline_length         1     1.193 58.615 -12313
## - ad_format               2     2.823 60.245 -12230
## - cta_strength            1     2.805 60.228 -12228
## - targeting_score         1    17.842 75.264 -11534
## - visual_appeal           1    32.356 89.778 -10985
## 
## Step:  AIC=-12376.84
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + seasonality + headline_sentiment + 
##     headline_word_count + headline_power_words + body_sentiment + 
##     headline_question + headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - seasonality             1     0.003 57.427 -12379
## - headline_power_words    1     0.006 57.430 -12378
## - headline_question       1     0.006 57.430 -12378
## - body_sentiment          1     0.010 57.434 -12378
## - headline_sentiment      1     0.012 57.435 -12378
## - location                3     0.086 57.509 -12378
## - headline_word_count     1     0.012 57.436 -12378
## - headline_numbers        1     0.019 57.442 -12378
## - id                      1     0.037 57.461 -12377
## <none>                                57.424 -12377
## - ad_frequency            1     0.045 57.469 -12376
## - market_saturation       1     0.057 57.481 -12376
## - brand_familiarity       1     0.058 57.482 -12376
## + body_word_count         1     0.001 57.423 -12375
## + body_keyword_density    1     0.000 57.424 -12375
## + body_readability_score  1     0.000 57.424 -12375
## + body_text_length        1     0.000 57.424 -12375
## + day_of_week             6     0.182 57.241 -12375
## + gender                  2     0.018 57.406 -12374
## - position_on_page        2     0.157 57.581 -12372
## - time_of_day             3     0.210 57.634 -12372
## + age_group               7     0.131 57.293 -12370
## - contextual_relevance    1     0.169 57.593 -12370
## - device_type             2     0.253 57.677 -12367
## - headline_length         1     1.196 58.620 -12315
## - ad_format               2     2.826 60.250 -12231
## - cta_strength            1     2.804 60.228 -12230
## - targeting_score         1    17.841 75.265 -11536
## - visual_appeal           1    32.365 89.789 -10987
## 
## Step:  AIC=-12378.69
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + headline_sentiment + headline_word_count + 
##     headline_power_words + body_sentiment + headline_question + 
##     headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - headline_question       1     0.006 57.433 -12380
## - headline_power_words    1     0.006 57.433 -12380
## - body_sentiment          1     0.010 57.437 -12380
## - headline_sentiment      1     0.012 57.438 -12380
## - location                3     0.086 57.512 -12380
## - headline_word_count     1     0.012 57.439 -12380
## - headline_numbers        1     0.019 57.445 -12380
## - id                      1     0.037 57.463 -12379
## <none>                                57.427 -12379
## - ad_frequency            1     0.045 57.472 -12378
## - market_saturation       1     0.058 57.484 -12378
## - brand_familiarity       1     0.058 57.484 -12378
## + seasonality             1     0.003 57.424 -12377
## + body_word_count         1     0.001 57.425 -12377
## + body_keyword_density    1     0.000 57.426 -12377
## + body_readability_score  1     0.000 57.426 -12377
## + body_text_length        1     0.000 57.427 -12377
## + day_of_week             6     0.182 57.244 -12377
## + gender                  2     0.018 57.409 -12376
## - position_on_page        2     0.156 57.582 -12374
## - time_of_day             3     0.211 57.637 -12373
## + age_group               7     0.132 57.295 -12372
## - contextual_relevance    1     0.170 57.596 -12372
## - device_type             2     0.252 57.679 -12369
## - headline_length         1     1.194 58.621 -12317
## - ad_format               2     2.824 60.251 -12233
## - cta_strength            1     2.804 60.230 -12232
## - targeting_score         1    17.839 75.265 -11538
## - visual_appeal           1    32.376 89.802 -10988
## 
## Step:  AIC=-12380.35
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + headline_sentiment + headline_word_count + 
##     headline_power_words + body_sentiment + headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - headline_power_words    1     0.006 57.439 -12382
## - body_sentiment          1     0.010 57.443 -12382
## - headline_sentiment      1     0.012 57.445 -12382
## - location                3     0.087 57.519 -12382
## - headline_word_count     1     0.013 57.446 -12382
## - headline_numbers        1     0.019 57.452 -12381
## - id                      1     0.036 57.469 -12380
## <none>                                57.433 -12380
## - ad_frequency            1     0.045 57.478 -12380
## - market_saturation       1     0.056 57.489 -12379
## - brand_familiarity       1     0.059 57.492 -12379
## + headline_question       1     0.006 57.427 -12379
## + seasonality             1     0.003 57.430 -12378
## + body_word_count         1     0.001 57.432 -12378
## + body_keyword_density    1     0.000 57.432 -12378
## + body_readability_score  1     0.000 57.433 -12378
## + body_text_length        1     0.000 57.433 -12378
## + day_of_week             6     0.182 57.251 -12378
## + gender                  2     0.018 57.415 -12377
## - position_on_page        2     0.155 57.588 -12376
## - time_of_day             3     0.209 57.642 -12375
## + age_group               7     0.130 57.303 -12373
## - contextual_relevance    1     0.168 57.601 -12373
## - device_type             2     0.253 57.686 -12371
## - headline_length         1     1.192 58.625 -12318
## - ad_format               2     2.822 60.255 -12235
## - cta_strength            1     2.803 60.236 -12234
## - targeting_score         1    17.842 75.275 -11540
## - visual_appeal           1    32.373 89.806 -10990
## 
## Step:  AIC=-12382
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + headline_sentiment + headline_word_count + 
##     body_sentiment + headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - body_sentiment          1     0.010 57.449 -12383
## - headline_sentiment      1     0.012 57.452 -12383
## - headline_word_count     1     0.013 57.452 -12383
## - location                3     0.087 57.527 -12383
## - headline_numbers        1     0.019 57.458 -12383
## - id                      1     0.036 57.475 -12382
## <none>                                57.439 -12382
## - ad_frequency            1     0.045 57.484 -12382
## - market_saturation       1     0.057 57.496 -12381
## - brand_familiarity       1     0.058 57.497 -12381
## + headline_power_words    1     0.006 57.433 -12380
## + headline_question       1     0.006 57.433 -12380
## + seasonality             1     0.003 57.436 -12380
## + body_word_count         1     0.001 57.438 -12380
## + body_keyword_density    1     0.000 57.439 -12380
## + body_readability_score  1     0.000 57.439 -12380
## + body_text_length        1     0.000 57.439 -12380
## + day_of_week             6     0.182 57.257 -12380
## + gender                  2     0.018 57.421 -12379
## - position_on_page        2     0.155 57.594 -12378
## - time_of_day             3     0.211 57.650 -12377
## + age_group               7     0.128 57.311 -12375
## - contextual_relevance    1     0.170 57.609 -12375
## - device_type             2     0.252 57.692 -12372
## - headline_length         1     1.193 58.632 -12320
## - ad_format               2     2.830 60.270 -12236
## - cta_strength            1     2.799 60.238 -12236
## - targeting_score         1    17.871 75.310 -11540
## - visual_appeal           1    32.368 89.807 -10992
## 
## Step:  AIC=-12383.45
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + headline_sentiment + headline_word_count + 
##     headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - headline_sentiment      1     0.013 57.462 -12385
## - headline_word_count     1     0.013 57.462 -12385
## - location                3     0.087 57.537 -12385
## - headline_numbers        1     0.020 57.469 -12384
## - id                      1     0.037 57.486 -12384
## <none>                                57.449 -12383
## - ad_frequency            1     0.046 57.495 -12383
## - market_saturation       1     0.057 57.506 -12382
## - brand_familiarity       1     0.058 57.507 -12382
## + body_sentiment          1     0.010 57.439 -12382
## + headline_power_words    1     0.006 57.443 -12382
## + headline_question       1     0.006 57.443 -12382
## + seasonality             1     0.003 57.447 -12382
## + body_word_count         1     0.001 57.448 -12382
## + body_keyword_density    1     0.001 57.449 -12382
## + body_readability_score  1     0.000 57.449 -12382
## + body_text_length        1     0.000 57.449 -12381
## + day_of_week             6     0.184 57.266 -12381
## + gender                  2     0.018 57.432 -12380
## - position_on_page        2     0.154 57.604 -12379
## - time_of_day             3     0.210 57.660 -12378
## + age_group               7     0.128 57.322 -12376
## - contextual_relevance    1     0.170 57.619 -12376
## - device_type             2     0.251 57.701 -12374
## - headline_length         1     1.197 58.647 -12321
## - ad_format               2     2.824 60.273 -12238
## - cta_strength            1     2.811 60.260 -12237
## - targeting_score         1    17.863 75.312 -11542
## - visual_appeal           1    32.368 89.817 -10994
## 
## Step:  AIC=-12384.76
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + headline_word_count + 
##     headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - location                3     0.086 57.548 -12386
## - headline_word_count     1     0.012 57.474 -12386
## - headline_numbers        1     0.019 57.481 -12386
## - id                      1     0.036 57.498 -12385
## <none>                                57.462 -12385
## - ad_frequency            1     0.045 57.507 -12384
## - market_saturation       1     0.059 57.521 -12384
## - brand_familiarity       1     0.059 57.521 -12384
## + headline_sentiment      1     0.013 57.449 -12383
## + body_sentiment          1     0.011 57.452 -12383
## + headline_question       1     0.007 57.455 -12383
## + headline_power_words    1     0.006 57.456 -12383
## + seasonality             1     0.003 57.459 -12383
## + body_word_count         1     0.001 57.461 -12383
## + day_of_week             6     0.185 57.277 -12383
## + body_keyword_density    1     0.001 57.461 -12383
## + body_readability_score  1     0.000 57.462 -12383
## + body_text_length        1     0.000 57.462 -12383
## + gender                  2     0.017 57.445 -12382
## - position_on_page        2     0.154 57.616 -12380
## - time_of_day             3     0.209 57.671 -12380
## + age_group               7     0.129 57.333 -12378
## - contextual_relevance    1     0.171 57.633 -12378
## - device_type             2     0.251 57.713 -12375
## - headline_length         1     1.196 58.658 -12323
## - ad_format               2     2.828 60.290 -12239
## - cta_strength            1     2.812 60.274 -12238
## - targeting_score         1    17.933 75.395 -11541
## - visual_appeal           1    32.440 89.902 -10993
## 
## Step:  AIC=-12386.13
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation + headline_word_count + headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - headline_word_count     1     0.013 57.561 -12387
## - headline_numbers        1     0.020 57.567 -12387
## - id                      1     0.035 57.583 -12386
## <none>                                57.548 -12386
## - ad_frequency            1     0.046 57.593 -12386
## - brand_familiarity       1     0.057 57.605 -12385
## - market_saturation       1     0.058 57.606 -12385
## + location                3     0.086 57.462 -12385
## + headline_sentiment      1     0.011 57.537 -12385
## + body_sentiment          1     0.011 57.537 -12385
## + headline_question       1     0.008 57.540 -12384
## + headline_power_words    1     0.007 57.541 -12384
## + seasonality             1     0.003 57.545 -12384
## + body_word_count         1     0.000 57.547 -12384
## + body_keyword_density    1     0.000 57.547 -12384
## + body_readability_score  1     0.000 57.547 -12384
## + body_text_length        1     0.000 57.548 -12384
## + day_of_week             6     0.176 57.372 -12384
## + gender                  2     0.017 57.531 -12383
## - position_on_page        2     0.150 57.697 -12382
## - time_of_day             3     0.208 57.755 -12381
## + age_group               7     0.126 57.421 -12379
## - contextual_relevance    1     0.172 57.720 -12379
## - device_type             2     0.244 57.792 -12377
## - headline_length         1     1.198 58.746 -12324
## - ad_format               2     2.827 60.375 -12241
## - cta_strength            1     2.831 60.379 -12239
## - targeting_score         1    17.954 75.502 -11542
## - visual_appeal           1    32.625 90.172 -10990
## 
## Step:  AIC=-12387.43
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation + headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - headline_numbers        1     0.018 57.579 -12388
## - id                      1     0.034 57.595 -12388
## <none>                                57.561 -12387
## - ad_frequency            1     0.045 57.606 -12387
## - brand_familiarity       1     0.055 57.616 -12386
## - market_saturation       1     0.057 57.618 -12386
## + headline_word_count     1     0.013 57.548 -12386
## + location                3     0.086 57.474 -12386
## + headline_sentiment      1     0.010 57.550 -12386
## + body_sentiment          1     0.010 57.551 -12386
## + headline_question       1     0.009 57.552 -12386
## + headline_power_words    1     0.007 57.553 -12386
## + seasonality             1     0.003 57.558 -12386
## + body_keyword_density    1     0.001 57.560 -12386
## + body_word_count         1     0.000 57.560 -12386
## + body_readability_score  1     0.000 57.560 -12385
## + body_text_length        1     0.000 57.561 -12385
## + day_of_week             6     0.172 57.389 -12385
## + gender                  2     0.017 57.544 -12384
## - position_on_page        2     0.150 57.710 -12383
## - time_of_day             3     0.208 57.769 -12382
## + age_group               7     0.125 57.436 -12380
## - contextual_relevance    1     0.173 57.734 -12380
## - device_type             2     0.242 57.803 -12378
## - headline_length         1     1.194 58.754 -12326
## - ad_format               2     2.832 60.393 -12242
## - cta_strength            1     2.832 60.392 -12240
## - targeting_score         1    17.974 75.534 -11543
## - visual_appeal           1    32.618 90.179 -10991
## 
## Step:  AIC=-12388.45
## CTR ~ id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation
## 
##                          Df Sum of Sq    RSS    AIC
## - id                      1     0.033 57.612 -12389
## <none>                                57.579 -12388
## - ad_frequency            1     0.045 57.624 -12388
## - brand_familiarity       1     0.055 57.634 -12388
## + headline_numbers        1     0.018 57.561 -12387
## - market_saturation       1     0.058 57.636 -12387
## + location                3     0.087 57.492 -12387
## + headline_word_count     1     0.011 57.567 -12387
## + body_sentiment          1     0.011 57.568 -12387
## + headline_sentiment      1     0.010 57.569 -12387
## + headline_question       1     0.009 57.570 -12387
## + headline_power_words    1     0.007 57.572 -12387
## + seasonality             1     0.003 57.576 -12387
## + body_keyword_density    1     0.001 57.578 -12386
## + body_word_count         1     0.001 57.578 -12386
## + body_readability_score  1     0.000 57.578 -12386
## + body_text_length        1     0.000 57.579 -12386
## + day_of_week             6     0.173 57.406 -12386
## + gender                  2     0.016 57.562 -12385
## - position_on_page        2     0.147 57.726 -12384
## - time_of_day             3     0.205 57.784 -12383
## + age_group               7     0.128 57.451 -12381
## - contextual_relevance    1     0.176 57.755 -12381
## - device_type             2     0.241 57.820 -12379
## - headline_length         1     1.199 58.777 -12326
## - ad_format               2     2.836 60.415 -12243
## - cta_strength            1     2.828 60.407 -12241
## - targeting_score         1    17.960 75.539 -11545
## - visual_appeal           1    32.603 90.182 -10993
## 
## Step:  AIC=-12388.66
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation
## 
##                          Df Sum of Sq    RSS    AIC
## <none>                                57.612 -12389
## + id                      1     0.033 57.579 -12388
## - ad_frequency            1     0.047 57.659 -12388
## - brand_familiarity       1     0.055 57.667 -12388
## + headline_numbers        1     0.017 57.595 -12388
## - market_saturation       1     0.058 57.669 -12388
## + location                3     0.086 57.526 -12387
## + body_sentiment          1     0.011 57.600 -12387
## + headline_word_count     1     0.010 57.601 -12387
## + headline_sentiment      1     0.010 57.602 -12387
## + headline_question       1     0.008 57.603 -12387
## + headline_power_words    1     0.007 57.605 -12387
## + seasonality             1     0.003 57.609 -12387
## + body_keyword_density    1     0.001 57.611 -12387
## + body_word_count         1     0.001 57.611 -12387
## + body_readability_score  1     0.001 57.611 -12387
## + body_text_length        1     0.000 57.612 -12387
## + day_of_week             6     0.174 57.437 -12386
## + gender                  2     0.016 57.596 -12386
## - position_on_page        2     0.146 57.758 -12385
## - time_of_day             3     0.201 57.812 -12384
## + age_group               7     0.126 57.486 -12382
## - contextual_relevance    1     0.174 57.785 -12381
## - device_type             2     0.237 57.848 -12380
## - headline_length         1     1.191 58.803 -12327
## - ad_format               2     2.838 60.450 -12243
## - cta_strength            1     2.836 60.447 -12241
## - targeting_score         1    17.967 75.579 -11545
## - visual_appeal           1    32.647 90.259 -10993

#summary(Step5)
Step5_pred <- predict(Step5, newdata = scoring_data)
submission_file_5 = data.frame(id = scoring_data$id, CTR = Step5_pred) # Construct submission from predictions
write.csv(submission_file_5, 'sample_submission_model5.csv',row.names = F)

Random Forest Model

After exploring simple and multiple regression, I moved on to Random Forest Models. For data exploration, I considered potential missing values in the dataset. After generating predictions, I used mutate and across from dplyr to replace any missing numerical values with the median, ensuring the submission file was complete and met the required format.

I utilized the ranger package to build an efficient Random Forest model with 1,000 trees, which improves computation time compared to the standard randomForest package. The model incorporated multiple predictors, including demographic, content, and engagement-related variables. By using predict with the scoring data, I generated CTR predictions. Handling the predictions carefully, I extracted the predicted values and addressed any missing values. This process taught me the importance of efficient modeling techniques and meticulous data handling to produce reliable predictions suitable for submission. As a result, the model produced in a public RMSE value of 0.1.

# install.packages("tidyr")
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.3.2

# Load the dataset
data_CTR <- read.csv("analysis_data.csv")  # Ensure 'analysis_data.csv' has PAC as target variable
scoring_data <- read.csv("scoring_data.csv")  # Kaggle scoring data without CTR

## Forest with Ranger ######
library(ranger)

## Warning: package 'ranger' was built under R version 4.3.3

## 
## Attaching package: 'ranger'

## The following object is masked from 'package:randomForest':
## 
##     importance

set.seed(1031)
forest_ranger = ranger(CTR~targeting_score + visual_appeal + contextual_relevance + headline_length + cta_strength + position_on_page + ad_format + time_of_day + brand_familiarity + device_type + ad_frequency + market_saturation,data = data_CTR, num.trees = 1000)
pred = predict(forest_ranger, data = scoring_data, num.trees = 1000)

submission_file_n2 = data.frame(id = scoring_data$id, CTR = pred)

submission_file_n21 <- submission_file_n2 %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .))) 
write.csv(submission_file_n21, 'sample_submission_model_n21.csv',row.names = F)

Random Forest Model Continued

To prepare for modeling, I defined a cross-validation setup and created a parameter grid for tuning key Random Forest hyperparameters. These included mtry (number of predictors to split at each node), splitrule (splitting criterion), and min.node.size (minimum size of terminal nodes). Setting up this grid taught me the importance of systematically exploring parameter combinations to enhance model performance.

Using the caret package, I performed 5-fold cross-validation to identify the best hyperparameters for the Random Forest model. The final model was trained with the ranger package, leveraging the optimal parameters (mtry, splitrule, and min.node.size) identified during tuning. This process demonstrated the value of cross-validation in improving predictive accuracy and avoiding overfitting.

I used the optimized model to predict CTR values for the scoring dataset. Post-prediction, I handled missing predictions by replacing them with the median, ensuring no gaps in the submission file. The final predictions were combined with scoring dataset IDs and saved in the required format. This model resulted in a public RMSE value of 0.095, which was a huge improvement over previous models.

# install.packages("tidyr")
library(randomForest)
library(dplyr)
library(caret)
library(tidyr)

# Load the dataset
data_CTR <- read.csv("analysis_data.csv")  # Ensure 'analysis_data.csv' has PAC as target variable
scoring_data <- read.csv("scoring_data.csv")  # Kaggle scoring data without CTR

data_CTR <- na.omit(data_CTR)

## different values of mtry, splitrule and min.node.size
trControl=trainControl(method="cv",number=5)
tuneGrid = expand.grid(mtry=1:4, 
                       splitrule = c('variance','extratrees','maxstat'), 
                       min.node.size = c(2,5,10,15,20,25))
set.seed(617)
cvModel = train(CTR~targeting_score + visual_appeal + contextual_relevance + headline_length + cta_strength + position_on_page + ad_format + time_of_day + brand_familiarity + device_type + ad_frequency + market_saturation,data = data_CTR,
                method="ranger",
                num.trees=1000,
                trControl=trControl,
                tuneGrid=tuneGrid )
cv_forest_ranger = ranger(CTR~targeting_score + visual_appeal + contextual_relevance + headline_length + cta_strength + position_on_page + ad_format + time_of_day + brand_familiarity + device_type + ad_frequency + market_saturation,data = data_CTR,
                          num.trees = 1000, 
                          mtry=cvModel$bestTune$mtry, 
                          min.node.size = cvModel$bestTune$min.node.size, 
                          splitrule = cvModel$bestTune$splitrule)
pred = predict(cv_forest_ranger, data =scoring_data, num.trees = 1000)

submission_file_n3 = data.frame(id = scoring_data$id, CTR = pred)

submission_file_n31 <- submission_file_n3 %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mode(., na.rm = TRUE), .))) 
write.csv(submission_file_n31, 'sample_submission_model_n31.csv',row.names = F)

Final Prediction Model: Generalized Additive Models (GAM)

I started by loading and exploring the training dataset, which included the target variable CTR and several predictors. Missing values were identified as a key challenge, prompting the use of median imputation for numeric variables and mode imputation for categorical variables. This step emphasized the importance of data quality and consistency in ensuring reliable model performance.

Using the recipes package, I defined a data cleaning pipeline that streamlined the handling of missing values and ensured that the id variable was excluded as a predictor. Applying the recipe to both the training and scoring datasets ensured consistent preprocessing across datasets. This experience highlighted the value of modular preprocessing for repeatable and error-free data cleaning.

Stepwise regression was employed to refine the feature set by iteratively adding and removing predictors based on their contribution to the model. The visualizations created using ggplot2 helped identify nonlinear relationships between predictors like targeting_score and CTR. These insights informed the decision to use smoothing splines in the GAM to better capture these nonlinear effects. This iterative approach taught me the importance of exploratory visualization and systematic feature selection.

I used a Generalized Additive Model (GAM) to incorporate nonlinear relationships for predictors identified during visualization. The mgcv package allowed for flexible modeling with smoothing splines, while the REML method ensured efficient parameter estimation. Predictions for the scoring dataset were generated using the GAM, and the results were saved in the required submission format. This process reinforced the importance of aligning modeling choices with data patterns and competition requirements.

Eventually, this GAM model produced a public RMSE value of 0.07864, demonstrating the effectiveness of incorporating nonlinear relationships and advanced modeling techniques in improving prediction accuracy.

# Load necessary libraries
library(recipes)  # For data preprocessing

## Warning: package 'recipes' was built under R version 4.3.3

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:stats':
## 
##     step

library(dplyr)    # For data manipulation
library(leaps)    # For stepwise regression

## Warning: package 'leaps' was built under R version 4.3.3

library(MASS)     # For stepwise regression (stepAIC function)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(ggplot2)  # For data visualization
library(gridExtra)  # For arranging multiple ggplot objects

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:randomForest':
## 
##     combine

library(mgcv)     # For Generalized Additive Models (GAM)

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.

# Load the dataset
data <- read.csv('analysis_data.csv')  # Training dataset with CTR as the target variable

# Data Cleaning using recipes
data_recipe <- recipe(CTR ~ ., data = data) %>%  # Define the recipe for preprocessing
  update_role(id, new_role = "id") %>%           # Specify that 'id' is not a predictor
  step_impute_median(all_numeric_predictors()) %>%  # Impute missing numeric values with the median
  step_impute_mode(all_nominal_predictors()) %>%   # Impute missing categorical values with the mode
  prep()                                          # Prepare the recipe for baking
data_clean <- bake(data_recipe, new_data = data)  # Apply the recipe to clean the dataset

# Stepwise Regression to Select Predictors
full_model <- lm(CTR ~ . - id, data = data_clean)  # Fit a full linear model excluding 'id'
stepwise_model_both <- stepAIC(full_model, direction = "both")  # Perform stepwise regression

## Start:  AIC=-14644.85
## CTR ~ (id + targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     age_group + gender + location + time_of_day + day_of_week + 
##     brand_familiarity + device_type + ad_frequency + market_saturation + 
##     seasonality + headline_sentiment + headline_word_count + 
##     headline_power_words + body_text_length + body_word_count + 
##     body_sentiment + headline_question + headline_numbers + body_keyword_density + 
##     body_readability_score) - id
## 
##                          Df Sum of Sq    RSS    AIC
## - age_group               7     0.147 100.56 -14653
## - day_of_week             6     0.138 100.56 -14651
## - location                3     0.073 100.49 -14648
## - gender                  2     0.041 100.46 -14647
## - seasonality             1     0.000 100.42 -14647
## - body_keyword_density    1     0.001 100.42 -14647
## - body_word_count         1     0.001 100.42 -14647
## - body_readability_score  1     0.006 100.42 -14647
## - body_text_length        1     0.011 100.43 -14646
## - brand_familiarity       1     0.012 100.43 -14646
## - headline_word_count     1     0.014 100.43 -14646
## - headline_power_words    1     0.015 100.43 -14646
## - headline_question       1     0.018 100.44 -14646
## - headline_numbers        1     0.024 100.44 -14646
## - ad_frequency            1     0.038 100.45 -14645
## - time_of_day             3     0.143 100.56 -14645
## - market_saturation       1     0.045 100.46 -14645
## <none>                                100.42 -14645
## - body_sentiment          1     0.054 100.47 -14645
## - headline_sentiment      1     0.055 100.47 -14645
## - position_on_page        2     0.213 100.63 -14640
## - device_type             2     0.298 100.72 -14637
## - contextual_relevance    1     0.545 100.96 -14625
## - headline_length         1     1.508 101.92 -14587
## - cta_strength            1     2.974 103.39 -14530
## - ad_format               2     3.478 103.89 -14513
## - targeting_score         1    19.940 120.36 -13922
## - visual_appeal           1    43.063 143.48 -13219
## 
## Step:  AIC=-14653.01
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     gender + location + time_of_day + day_of_week + brand_familiarity + 
##     device_type + ad_frequency + market_saturation + seasonality + 
##     headline_sentiment + headline_word_count + headline_power_words + 
##     body_text_length + body_word_count + body_sentiment + headline_question + 
##     headline_numbers + body_keyword_density + body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - day_of_week             6     0.134 100.70 -14660
## - location                3     0.073 100.64 -14656
## - gender                  2     0.040 100.60 -14655
## - seasonality             1     0.000 100.56 -14655
## - body_word_count         1     0.001 100.56 -14655
## - body_keyword_density    1     0.001 100.56 -14655
## - body_readability_score  1     0.006 100.57 -14655
## - body_text_length        1     0.011 100.58 -14655
## - headline_word_count     1     0.013 100.58 -14654
## - brand_familiarity       1     0.013 100.58 -14654
## - headline_power_words    1     0.015 100.58 -14654
## - headline_question       1     0.023 100.59 -14654
## - headline_numbers        1     0.027 100.59 -14654
## - ad_frequency            1     0.040 100.60 -14653
## - time_of_day             3     0.149 100.71 -14653
## - market_saturation       1     0.050 100.61 -14653
## <none>                                100.56 -14653
## - body_sentiment          1     0.057 100.62 -14653
## - headline_sentiment      1     0.060 100.62 -14653
## - position_on_page        2     0.213 100.78 -14648
## - device_type             2     0.297 100.86 -14645
## + age_group               7     0.147 100.42 -14645
## - contextual_relevance    1     0.553 101.12 -14633
## - headline_length         1     1.522 102.09 -14595
## - cta_strength            1     3.022 103.59 -14537
## - ad_format               2     3.450 104.02 -14522
## - targeting_score         1    19.877 120.44 -13934
## - visual_appeal           1    43.150 143.71 -13227
## 
## Step:  AIC=-14659.69
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     gender + location + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + seasonality + headline_sentiment + 
##     headline_word_count + headline_power_words + body_text_length + 
##     body_word_count + body_sentiment + headline_question + headline_numbers + 
##     body_keyword_density + body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - location                3     0.072 100.77 -14663
## - gender                  2     0.043 100.74 -14662
## - seasonality             1     0.000 100.70 -14662
## - body_keyword_density    1     0.001 100.70 -14662
## - body_word_count         1     0.001 100.70 -14662
## - body_readability_score  1     0.006 100.70 -14661
## - body_text_length        1     0.009 100.71 -14661
## - brand_familiarity       1     0.012 100.71 -14661
## - headline_word_count     1     0.012 100.71 -14661
## - headline_power_words    1     0.013 100.71 -14661
## - headline_question       1     0.023 100.72 -14661
## - headline_numbers        1     0.030 100.73 -14660
## - ad_frequency            1     0.038 100.74 -14660
## - market_saturation       1     0.049 100.75 -14660
## <none>                                100.70 -14660
## - time_of_day             3     0.153 100.85 -14660
## - body_sentiment          1     0.060 100.76 -14659
## - headline_sentiment      1     0.061 100.76 -14659
## - position_on_page        2     0.208 100.91 -14655
## + day_of_week             6     0.134 100.56 -14653
## - device_type             2     0.306 101.00 -14652
## + age_group               7     0.142 100.56 -14651
## - contextual_relevance    1     0.554 101.25 -14640
## - headline_length         1     1.531 102.23 -14601
## - cta_strength            1     3.041 103.74 -14543
## - ad_format               2     3.403 104.10 -14531
## - targeting_score         1    20.008 120.71 -13937
## - visual_appeal           1    43.232 143.93 -13233
## 
## Step:  AIC=-14662.82
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     gender + time_of_day + brand_familiarity + device_type + 
##     ad_frequency + market_saturation + seasonality + headline_sentiment + 
##     headline_word_count + headline_power_words + body_text_length + 
##     body_word_count + body_sentiment + headline_question + headline_numbers + 
##     body_keyword_density + body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - gender                  2     0.039 100.81 -14665
## - seasonality             1     0.000 100.77 -14665
## - body_keyword_density    1     0.001 100.77 -14665
## - body_word_count         1     0.002 100.77 -14665
## - body_readability_score  1     0.006 100.78 -14665
## - body_text_length        1     0.009 100.78 -14664
## - brand_familiarity       1     0.012 100.78 -14664
## - headline_word_count     1     0.013 100.78 -14664
## - headline_power_words    1     0.015 100.78 -14664
## - headline_question       1     0.022 100.79 -14664
## - headline_numbers        1     0.030 100.80 -14664
## - ad_frequency            1     0.038 100.81 -14663
## - market_saturation       1     0.050 100.82 -14663
## <none>                                100.77 -14663
## - time_of_day             3     0.152 100.92 -14663
## - headline_sentiment      1     0.057 100.83 -14663
## - body_sentiment          1     0.062 100.83 -14662
## + location                3     0.072 100.70 -14660
## - position_on_page        2     0.202 100.97 -14659
## + day_of_week             6     0.133 100.64 -14656
## - device_type             2     0.302 101.07 -14655
## + age_group               7     0.142 100.63 -14654
## - contextual_relevance    1     0.559 101.33 -14643
## - headline_length         1     1.537 102.31 -14604
## - cta_strength            1     3.075 103.85 -14545
## - ad_format               2     3.408 104.18 -14534
## - targeting_score         1    19.998 120.77 -13941
## - visual_appeal           1    43.354 144.12 -13234
## 
## Step:  AIC=-14665.27
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation + seasonality + headline_sentiment + headline_word_count + 
##     headline_power_words + body_text_length + body_word_count + 
##     body_sentiment + headline_question + headline_numbers + body_keyword_density + 
##     body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - seasonality             1     0.000 100.81 -14667
## - body_keyword_density    1     0.001 100.81 -14667
## - body_word_count         1     0.002 100.81 -14667
## - body_readability_score  1     0.006 100.81 -14667
## - body_text_length        1     0.009 100.82 -14667
## - brand_familiarity       1     0.012 100.82 -14667
## - headline_word_count     1     0.013 100.82 -14667
## - headline_power_words    1     0.014 100.82 -14667
## - headline_question       1     0.021 100.83 -14666
## - headline_numbers        1     0.030 100.84 -14666
## - ad_frequency            1     0.039 100.85 -14666
## <none>                                100.81 -14665
## - time_of_day             3     0.152 100.96 -14665
## - market_saturation       1     0.052 100.86 -14665
## - headline_sentiment      1     0.057 100.87 -14665
## - body_sentiment          1     0.061 100.87 -14665
## + gender                  2     0.039 100.77 -14663
## + location                3     0.068 100.74 -14662
## - position_on_page        2     0.203 101.01 -14661
## + day_of_week             6     0.136 100.67 -14659
## - device_type             2     0.304 101.11 -14657
## + age_group               7     0.141 100.67 -14657
## - contextual_relevance    1     0.555 101.36 -14645
## - headline_length         1     1.533 102.34 -14607
## - cta_strength            1     3.088 103.90 -14547
## - ad_format               2     3.403 104.21 -14536
## - targeting_score         1    19.984 120.79 -13944
## - visual_appeal           1    43.328 144.14 -13237
## 
## Step:  AIC=-14667.26
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation + headline_sentiment + headline_word_count + 
##     headline_power_words + body_text_length + body_word_count + 
##     body_sentiment + headline_question + headline_numbers + body_keyword_density + 
##     body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - body_keyword_density    1     0.001 100.81 -14669
## - body_word_count         1     0.002 100.81 -14669
## - body_readability_score  1     0.006 100.82 -14669
## - body_text_length        1     0.009 100.82 -14669
## - brand_familiarity       1     0.012 100.82 -14669
## - headline_word_count     1     0.013 100.82 -14669
## - headline_power_words    1     0.014 100.82 -14669
## - headline_question       1     0.021 100.83 -14668
## - headline_numbers        1     0.030 100.84 -14668
## - ad_frequency            1     0.039 100.85 -14668
## <none>                                100.81 -14667
## - time_of_day             3     0.152 100.96 -14667
## - market_saturation       1     0.052 100.86 -14667
## - headline_sentiment      1     0.057 100.87 -14667
## - body_sentiment          1     0.061 100.87 -14667
## + seasonality             1     0.000 100.81 -14665
## + gender                  2     0.039 100.77 -14665
## + location                3     0.068 100.74 -14664
## - position_on_page        2     0.204 101.01 -14663
## + day_of_week             6     0.136 100.67 -14661
## - device_type             2     0.304 101.11 -14659
## + age_group               7     0.140 100.67 -14659
## - contextual_relevance    1     0.555 101.36 -14647
## - headline_length         1     1.533 102.34 -14609
## - cta_strength            1     3.088 103.90 -14549
## - ad_format               2     3.403 104.21 -14538
## - targeting_score         1    19.984 120.79 -13946
## - visual_appeal           1    43.332 144.14 -13239
## 
## Step:  AIC=-14669.22
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation + headline_sentiment + headline_word_count + 
##     headline_power_words + body_text_length + body_word_count + 
##     body_sentiment + headline_question + headline_numbers + body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - body_word_count         1     0.002 100.81 -14671
## - body_readability_score  1     0.006 100.82 -14671
## - body_text_length        1     0.009 100.82 -14671
## - brand_familiarity       1     0.012 100.82 -14671
## - headline_word_count     1     0.013 100.82 -14671
## - headline_power_words    1     0.014 100.83 -14671
## - headline_question       1     0.021 100.83 -14670
## - headline_numbers        1     0.030 100.84 -14670
## - ad_frequency            1     0.039 100.85 -14670
## <none>                                100.81 -14669
## - time_of_day             3     0.152 100.96 -14669
## - market_saturation       1     0.052 100.86 -14669
## - headline_sentiment      1     0.057 100.87 -14669
## - body_sentiment          1     0.061 100.87 -14669
## + body_keyword_density    1     0.001 100.81 -14667
## + seasonality             1     0.000 100.81 -14667
## + gender                  2     0.039 100.77 -14667
## + location                3     0.068 100.74 -14666
## - position_on_page        2     0.204 101.02 -14665
## + day_of_week             6     0.136 100.67 -14663
## - device_type             2     0.304 101.11 -14661
## + age_group               7     0.141 100.67 -14661
## - contextual_relevance    1     0.555 101.36 -14649
## - headline_length         1     1.533 102.34 -14611
## - cta_strength            1     3.089 103.90 -14550
## - ad_format               2     3.405 104.22 -14540
## - targeting_score         1    19.989 120.80 -13948
## - visual_appeal           1    43.331 144.14 -13241
## 
## Step:  AIC=-14671.15
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation + headline_sentiment + headline_word_count + 
##     headline_power_words + body_text_length + body_sentiment + 
##     headline_question + headline_numbers + body_readability_score
## 
##                          Df Sum of Sq    RSS    AIC
## - body_readability_score  1     0.006 100.82 -14673
## - body_text_length        1     0.010 100.82 -14673
## - headline_word_count     1     0.012 100.83 -14673
## - brand_familiarity       1     0.012 100.83 -14673
## - headline_power_words    1     0.014 100.83 -14673
## - headline_question       1     0.021 100.83 -14672
## - headline_numbers        1     0.030 100.84 -14672
## - ad_frequency            1     0.039 100.85 -14672
## <none>                                100.81 -14671
## - time_of_day             3     0.152 100.97 -14671
## - market_saturation       1     0.052 100.86 -14671
## - headline_sentiment      1     0.058 100.87 -14671
## - body_sentiment          1     0.061 100.87 -14671
## + body_word_count         1     0.002 100.81 -14669
## + body_keyword_density    1     0.001 100.81 -14669
## + seasonality             1     0.000 100.81 -14669
## + gender                  2     0.040 100.77 -14669
## + location                3     0.069 100.74 -14668
## - position_on_page        2     0.204 101.02 -14667
## + day_of_week             6     0.137 100.68 -14665
## - device_type             2     0.305 101.12 -14663
## + age_group               7     0.141 100.67 -14663
## - contextual_relevance    1     0.554 101.37 -14651
## - headline_length         1     1.533 102.35 -14613
## - cta_strength            1     3.095 103.91 -14552
## - ad_format               2     3.408 104.22 -14542
## - targeting_score         1    19.988 120.80 -13950
## - visual_appeal           1    43.331 144.14 -13243
## 
## Step:  AIC=-14672.92
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation + headline_sentiment + headline_word_count + 
##     headline_power_words + body_text_length + body_sentiment + 
##     headline_question + headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - body_text_length        1     0.010 100.83 -14674
## - brand_familiarity       1     0.012 100.83 -14674
## - headline_word_count     1     0.012 100.83 -14674
## - headline_power_words    1     0.015 100.83 -14674
## - headline_question       1     0.021 100.84 -14674
## - headline_numbers        1     0.030 100.85 -14674
## - ad_frequency            1     0.039 100.86 -14673
## <none>                                100.82 -14673
## - time_of_day             3     0.152 100.97 -14673
## - market_saturation       1     0.051 100.87 -14673
## - headline_sentiment      1     0.058 100.88 -14673
## - body_sentiment          1     0.060 100.88 -14672
## + body_readability_score  1     0.006 100.81 -14671
## + body_word_count         1     0.002 100.82 -14671
## + body_keyword_density    1     0.001 100.82 -14671
## + seasonality             1     0.000 100.82 -14671
## + gender                  2     0.039 100.78 -14670
## + location                3     0.069 100.75 -14670
## - position_on_page        2     0.204 101.02 -14669
## + day_of_week             6     0.137 100.68 -14666
## - device_type             2     0.306 101.12 -14665
## + age_group               7     0.141 100.68 -14664
## - contextual_relevance    1     0.554 101.37 -14653
## - headline_length         1     1.532 102.35 -14615
## - cta_strength            1     3.095 103.91 -14554
## - ad_format               2     3.414 104.23 -14544
## - targeting_score         1    19.997 120.82 -13951
## - visual_appeal           1    43.327 144.15 -13245
## 
## Step:  AIC=-14674.53
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation + headline_sentiment + headline_word_count + 
##     headline_power_words + body_sentiment + headline_question + 
##     headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - headline_word_count     1     0.012 100.84 -14676
## - brand_familiarity       1     0.012 100.84 -14676
## - headline_power_words    1     0.015 100.84 -14676
## - headline_question       1     0.022 100.85 -14676
## - headline_numbers        1     0.030 100.86 -14675
## - ad_frequency            1     0.039 100.87 -14675
## - time_of_day             3     0.149 100.98 -14675
## <none>                                100.83 -14674
## - market_saturation       1     0.052 100.88 -14674
## - headline_sentiment      1     0.058 100.89 -14674
## - body_sentiment          1     0.060 100.89 -14674
## + body_text_length        1     0.010 100.82 -14673
## + body_readability_score  1     0.006 100.82 -14673
## + body_word_count         1     0.002 100.83 -14673
## + body_keyword_density    1     0.001 100.83 -14673
## + seasonality             1     0.000 100.83 -14672
## + gender                  2     0.040 100.79 -14672
## + location                3     0.069 100.76 -14671
## - position_on_page        2     0.207 101.03 -14670
## + day_of_week             6     0.135 100.69 -14668
## - device_type             2     0.307 101.14 -14666
## + age_group               7     0.140 100.69 -14666
## - contextual_relevance    1     0.554 101.38 -14655
## - headline_length         1     1.522 102.35 -14617
## - cta_strength            1     3.090 103.92 -14556
## - ad_format               2     3.405 104.23 -14546
## - targeting_score         1    20.050 120.88 -13951
## - visual_appeal           1    43.327 144.16 -13247
## 
## Step:  AIC=-14676.07
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + brand_familiarity + device_type + ad_frequency + 
##     market_saturation + headline_sentiment + headline_power_words + 
##     body_sentiment + headline_question + headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - brand_familiarity       1     0.012 100.85 -14678
## - headline_power_words    1     0.015 100.86 -14678
## - headline_question       1     0.021 100.86 -14677
## - headline_numbers        1     0.028 100.87 -14677
## - ad_frequency            1     0.038 100.88 -14677
## - time_of_day             3     0.149 100.99 -14676
## <none>                                100.84 -14676
## - market_saturation       1     0.051 100.89 -14676
## - headline_sentiment      1     0.057 100.90 -14676
## - body_sentiment          1     0.059 100.90 -14676
## + headline_word_count     1     0.012 100.83 -14674
## + body_text_length        1     0.009 100.83 -14674
## + body_readability_score  1     0.006 100.83 -14674
## + body_word_count         1     0.002 100.84 -14674
## + body_keyword_density    1     0.001 100.84 -14674
## + seasonality             1     0.000 100.84 -14674
## + gender                  2     0.040 100.80 -14674
## + location                3     0.069 100.77 -14673
## - position_on_page        2     0.208 101.05 -14672
## + day_of_week             6     0.134 100.70 -14669
## - device_type             2     0.306 101.14 -14668
## + age_group               7     0.139 100.70 -14668
## - contextual_relevance    1     0.556 101.40 -14656
## - headline_length         1     1.539 102.38 -14618
## - cta_strength            1     3.085 103.92 -14558
## - ad_format               2     3.396 104.24 -14548
## - targeting_score         1    20.063 120.90 -13952
## - visual_appeal           1    43.332 144.17 -13248
## 
## Step:  AIC=-14677.61
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + device_type + ad_frequency + market_saturation + 
##     headline_sentiment + headline_power_words + body_sentiment + 
##     headline_question + headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - headline_power_words    1     0.015 100.87 -14679
## - headline_question       1     0.020 100.87 -14679
## - headline_numbers        1     0.029 100.88 -14678
## - ad_frequency            1     0.037 100.89 -14678
## - time_of_day             3     0.149 101.00 -14678
## - market_saturation       1     0.050 100.90 -14678
## <none>                                100.85 -14678
## - body_sentiment          1     0.058 100.91 -14677
## - headline_sentiment      1     0.059 100.91 -14677
## + brand_familiarity       1     0.012 100.84 -14676
## + headline_word_count     1     0.011 100.84 -14676
## + body_text_length        1     0.009 100.84 -14676
## + body_readability_score  1     0.006 100.84 -14676
## + body_word_count         1     0.002 100.85 -14676
## + body_keyword_density    1     0.001 100.85 -14676
## + seasonality             1     0.000 100.85 -14676
## + gender                  2     0.040 100.81 -14675
## + location                3     0.070 100.78 -14674
## - position_on_page        2     0.206 101.06 -14673
## + day_of_week             6     0.133 100.72 -14671
## - device_type             2     0.308 101.16 -14669
## + age_group               7     0.140 100.71 -14669
## - contextual_relevance    1     0.559 101.41 -14658
## - headline_length         1     1.536 102.39 -14619
## - cta_strength            1     3.080 103.93 -14559
## - ad_format               2     3.392 104.24 -14549
## - targeting_score         1    20.086 120.94 -13953
## - visual_appeal           1    43.325 144.18 -13250
## 
## Step:  AIC=-14679.01
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + device_type + ad_frequency + market_saturation + 
##     headline_sentiment + body_sentiment + headline_question + 
##     headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - headline_question       1     0.021 100.89 -14680
## - headline_numbers        1     0.030 100.90 -14680
## - ad_frequency            1     0.037 100.90 -14680
## - time_of_day             3     0.151 101.02 -14679
## <none>                                100.87 -14679
## - market_saturation       1     0.051 100.92 -14679
## - headline_sentiment      1     0.058 100.92 -14679
## - body_sentiment          1     0.058 100.92 -14679
## + headline_power_words    1     0.015 100.85 -14678
## + brand_familiarity       1     0.012 100.86 -14678
## + headline_word_count     1     0.011 100.86 -14677
## + body_text_length        1     0.009 100.86 -14677
## + body_readability_score  1     0.007 100.86 -14677
## + body_word_count         1     0.003 100.86 -14677
## + body_keyword_density    1     0.001 100.86 -14677
## + seasonality             1     0.000 100.87 -14677
## + gender                  2     0.039 100.83 -14677
## + location                3     0.071 100.80 -14676
## - position_on_page        2     0.205 101.07 -14675
## + day_of_week             6     0.132 100.73 -14672
## - device_type             2     0.306 101.17 -14671
## + age_group               7     0.140 100.73 -14670
## - contextual_relevance    1     0.560 101.43 -14659
## - headline_length         1     1.528 102.39 -14621
## - cta_strength            1     3.075 103.94 -14561
## - ad_format               2     3.397 104.26 -14550
## - targeting_score         1    20.118 120.98 -13954
## - visual_appeal           1    43.314 144.18 -13252
## 
## Step:  AIC=-14680.17
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + device_type + ad_frequency + market_saturation + 
##     headline_sentiment + body_sentiment + headline_numbers
## 
##                          Df Sum of Sq    RSS    AIC
## - headline_numbers        1     0.032 100.92 -14681
## - ad_frequency            1     0.036 100.92 -14681
## <none>                                100.89 -14680
## - time_of_day             3     0.154 101.04 -14680
## - market_saturation       1     0.054 100.94 -14680
## - headline_sentiment      1     0.056 100.94 -14680
## - body_sentiment          1     0.058 100.94 -14680
## + headline_question       1     0.021 100.87 -14679
## + headline_power_words    1     0.016 100.87 -14679
## + brand_familiarity       1     0.011 100.88 -14679
## + body_text_length        1     0.010 100.88 -14679
## + headline_word_count     1     0.010 100.88 -14679
## + body_readability_score  1     0.007 100.88 -14678
## + body_word_count         1     0.003 100.89 -14678
## + body_keyword_density    1     0.001 100.89 -14678
## + seasonality             1     0.000 100.89 -14678
## + gender                  2     0.038 100.85 -14678
## + location                3     0.070 100.82 -14677
## - position_on_page        2     0.206 101.09 -14676
## + day_of_week             6     0.132 100.76 -14673
## - device_type             2     0.307 101.19 -14672
## + age_group               7     0.144 100.74 -14672
## - contextual_relevance    1     0.561 101.45 -14660
## - headline_length         1     1.523 102.41 -14622
## - cta_strength            1     3.077 103.96 -14562
## - ad_format               2     3.401 104.29 -14552
## - targeting_score         1    20.119 121.01 -13955
## - visual_appeal           1    43.329 144.22 -13253
## 
## Step:  AIC=-14680.9
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + device_type + ad_frequency + market_saturation + 
##     headline_sentiment + body_sentiment
## 
##                          Df Sum of Sq    RSS    AIC
## - ad_frequency            1     0.036 100.95 -14682
## <none>                                100.92 -14681
## - time_of_day             3     0.153 101.07 -14681
## - market_saturation       1     0.055 100.97 -14681
## - headline_sentiment      1     0.055 100.97 -14681
## - body_sentiment          1     0.059 100.98 -14681
## + headline_numbers        1     0.032 100.89 -14680
## + headline_question       1     0.023 100.90 -14680
## + headline_power_words    1     0.018 100.90 -14680
## + brand_familiarity       1     0.011 100.91 -14679
## + body_text_length        1     0.010 100.91 -14679
## + headline_word_count     1     0.008 100.91 -14679
## + body_readability_score  1     0.006 100.91 -14679
## + body_word_count         1     0.003 100.92 -14679
## + body_keyword_density    1     0.001 100.92 -14679
## + seasonality             1     0.000 100.92 -14679
## + gender                  2     0.038 100.88 -14678
## + location                3     0.070 100.85 -14678
## - position_on_page        2     0.204 101.12 -14677
## + day_of_week             6     0.135 100.78 -14674
## - device_type             2     0.305 101.22 -14673
## + age_group               7     0.147 100.77 -14673
## - contextual_relevance    1     0.569 101.49 -14660
## - headline_length         1     1.515 102.43 -14623
## - cta_strength            1     3.073 103.99 -14563
## - ad_format               2     3.406 104.33 -14552
## - targeting_score         1    20.094 121.01 -13957
## - visual_appeal           1    43.349 144.27 -13254
## 
## Step:  AIC=-14681.48
## CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + device_type + market_saturation + headline_sentiment + 
##     body_sentiment
## 
##                          Df Sum of Sq    RSS    AIC
## <none>                                100.95 -14682
## - time_of_day             3     0.154 101.11 -14681
## - headline_sentiment      1     0.054 101.01 -14681
## - market_saturation       1     0.055 101.01 -14681
## - body_sentiment          1     0.061 101.02 -14681
## + ad_frequency            1     0.036 100.92 -14681
## + headline_numbers        1     0.032 100.92 -14681
## + headline_question       1     0.022 100.93 -14680
## + headline_power_words    1     0.018 100.94 -14680
## + brand_familiarity       1     0.010 100.94 -14680
## + body_text_length        1     0.010 100.94 -14680
## + headline_word_count     1     0.007 100.95 -14680
## + body_readability_score  1     0.007 100.95 -14680
## + body_word_count         1     0.003 100.95 -14680
## + body_keyword_density    1     0.001 100.95 -14680
## + seasonality             1     0.000 100.95 -14680
## + gender                  2     0.039 100.92 -14679
## + location                3     0.070 100.89 -14678
## - position_on_page        2     0.203 101.16 -14678
## + day_of_week             6     0.134 100.82 -14675
## + age_group               7     0.149 100.81 -14673
## - device_type             2     0.306 101.26 -14673
## - contextual_relevance    1     0.561 101.52 -14661
## - headline_length         1     1.510 102.47 -14624
## - cta_strength            1     3.054 104.01 -14564
## - ad_format               2     3.421 104.38 -14552
## - targeting_score         1    20.081 121.04 -13958
## - visual_appeal           1    43.319 144.27 -13255

summary(stepwise_model_both)  # Display the summary of the stepwise-selected model

## 
## Call:
## lm(formula = CTR ~ targeting_score + visual_appeal + contextual_relevance + 
##     headline_length + cta_strength + position_on_page + ad_format + 
##     time_of_day + device_type + market_saturation + headline_sentiment + 
##     body_sentiment, data = data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3614 -0.0826 -0.0142  0.0552  3.4929 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -0.0380534  0.0109823  -3.465 0.000536 ***
## targeting_score              0.0262577  0.0009330  28.143  < 2e-16 ***
## visual_appeal                0.0257729  0.0006235  41.335  < 2e-16 ***
## contextual_relevance         0.0238221  0.0050639   4.704 2.63e-06 ***
## headline_length             -0.0011788  0.0001527  -7.718 1.48e-14 ***
## cta_strength                 0.0103041  0.0009388  10.976  < 2e-16 ***
## position_on_pageSide Banner -0.0012955  0.0062392  -0.208 0.835517    
## position_on_pageTop Banner   0.0142881  0.0061993   2.305 0.021230 *  
## ad_formatText                0.0090833  0.0086866   1.046 0.295775    
## ad_formatVideo               0.0663542  0.0065899  10.069  < 2e-16 ***
## time_of_dayEvening          -0.0100089  0.0068552  -1.460 0.144354    
## time_of_dayMorning           0.0045652  0.0061161   0.746 0.455455    
## time_of_dayNight            -0.0124522  0.0088792  -1.402 0.160875    
## device_typeMobile            0.0195195  0.0056201   3.473 0.000520 ***
## device_typeTablet            0.0138661  0.0184547   0.751 0.452481    
## market_saturation           -0.0013665  0.0009293  -1.471 0.141500    
## headline_sentiment          -0.0018636  0.0012810  -1.455 0.145803    
## body_sentiment               0.0020057  0.0012932   1.551 0.120971    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1592 on 3982 degrees of freedom
## Multiple R-squared:  0.4247, Adjusted R-squared:  0.4223 
## F-statistic: 172.9 on 17 and 3982 DF,  p-value: < 2.2e-16

# Visualize Potential Nonlinear Relationships
# Create scatterplots to identify variables needing smoothing
p1 <- ggplot(data_clean, aes(x = targeting_score, y = CTR)) +
  geom_point(alpha = 0.5) +
  labs(title = "CTR vs Targeting Score", x = "Targeting Score", y = "CTR") +
  theme_minimal()
p2 <- ggplot(data_clean, aes(x = visual_appeal, y = CTR)) +
  geom_point(alpha = 0.5) +
  labs(title = "CTR vs Visual Appeal", x = "Visual Appeal", y = "CTR") +
  theme_minimal()
p3 <- ggplot(data_clean, aes(x = headline_length, y = CTR)) +
  geom_point(alpha = 0.5) +
  labs(title = "CTR vs Headline Length", x = "Headline Length", y = "CTR") +
  theme_minimal()
p4 <- ggplot(data_clean, aes(x = cta_strength, y = CTR)) +
  geom_point(alpha = 0.5) +
  labs(title = "CTR vs CTA Strength", x = "CTA Strength", y = "CTR") +
  theme_minimal()
grid.arrange(p1, p2, p3, p4, ncol = 2)  # Arrange plots in a 2x2 grid

# Fit a Generalized Additive Model (GAM)
y <- data_clean$CTR
gam_model <- gam(
  y ~ s(targeting_score) + s(visual_appeal) + contextual_relevance +
    s(headline_length) + s(cta_strength) + position_on_page + ad_format +
    device_type,  # Use smoothing for numeric variables with nonlinear relationships
  data = data_clean,
  method = 'REML'  # Use restricted maximum likelihood for smoothing parameter estimation
)

# Make Predictions
scoring_data <- read.csv('scoring_data.csv')  # Load scoring dataset
scoring_data_clean <- bake(data_recipe, new_data = scoring_data)  # Clean the scoring data using the recipe
pred2 <- predict(gam_model, newdata = scoring_data_clean)  # Generate predictions

# Create and Save Submission File
submission_file <- data.frame(id = scoring_data$id, CTR = pred2)  # Combine predictions with IDs
write.csv(submission_file, 'sample_submission.csv', row.names = FALSE)  # Save the submission file