Including interaction terms in the model

From NorthShore Analytics
Jump to: navigation, search

Common sense suggests that an optimal choice among models with approximately equal performance characteristics is the one that has the fewest “moving parts”. This principal is often (simplistically) referred to as Occam’s razor[36] and quoted as “Numquam ponenda est pluralitas sine necessitate” (Plurality must never be posited without necessity), and “Frustra fit per plura quod potest fieri per pauciora” (It is futile to do with more what can be done with less)[32]. In agreement with this principle, we generally prefer linear models to their nonlinear counterparts as long as their performance metrics do not differ significantly. There are cases, however, when a linear model simply will not do (see, e.g., the example in Section 3.8). We are not aware of any universal recipe for selecting a specific variable transformation in every possible instance. If there are sufficient reasons to suspect from general subject domain considerations that predictive variables may influence each other, introducing interaction terms may improve model performance.

Table 3.21 illustrates a contrived example of hospitalization data for a hypothetical population of patients.

Figure 3.21 Example: Hypothetical hospitalizations.

For each patient in Table 3.21, both age and sex were generated randomly, and only females whose age is at or above the median age of the sample less 5 were hospitalized. Since the data is random by construction, we are at liberty to use the first 20 rows of the table for training and the remaining 10 rows for testing our models. The results of applying a strictly linear model of the form \(hospitalization \sim age + sex\) to the testing dataset are presented in Fig. 3.13.

Figure 3.13 AUC, \(F_1\) score score and Matthews' correlation coefficient of the strictly linear logistic regression model for the hypothetical hospitalization example in Table 3.21.

The results of applying a linear model with an interaction term of the form \(hospitalization \sim age + sex + age \times sex\) to the testing dataset are presented in Fig. 3.14.

Not surprisingly, the performance of the model without the interaction terms (ROC curve AUC of 0.78 in Fig [F.Inter.1]) is inferior to that of the model with interaction terms included (ROC curve AUC of 1.00 in Fig 3.13), and the use of the more complicated model is justified.

Figure 3.14 AUC, \(F_1\) score score and Matthews' correlation coefficient of the logistic regression model with a cross term for the hypothetical hospitalization example in Table 3.21.

The code for generating Figs. 3.13 and 3.14 is presented in Listing 3.8.

linMod <- function( train, test, inclCol, outCol, sep, main ) {
  form <- formula( paste( outCol, "~", paste( inclCol, collapse=sep ) ) )
  gm <- glm( form, data=train, family='binomial' )
  prediction <- predict( gm, test )

  perf <- auc.perf.base( prediction, test[, outCol], text=main )
}

set.seed(1)
n <- 30
id <- 1:n
gender <- rnorm( id )
age <- round( 70 + 10 * rnorm( id ) )
sex <- ifelse( gender <= 0, "M", "F" )
y <- ifelse( ( age - median( age ) ) * ifelse( sex == 'M', 0, 1 ) <= -5, 0, 1 )
mt <- "Interaction term: age * gender"
data <- data.frame( ID=id, age=age, sex=sex, hospitalization=y )
trainRows <- 1:round( n * 2 / 3 )
testRows <- ( max( trainRows ) + 1 ):n
test <- data[testRows, ]
train <- data[trainRows, ]

main <- "Hospitalization model performance, strictly linear structure"
linMod(train, test, c( "age", "sex" ),  "hospitalization", '+',  main )
main <- "Hospitalization model performance, interaction terms included"
linMod(train, test, c( "age", "sex" ),  "hospitalization", '*',  main )

write.csv( data, "./IntTermEx.csv" )

Listing 3.8: Example: Hypothetical hospitalizations.