Predictive variable selection

From NorthShore Analytics
Jump to: navigation, search

One of the most essential steps in developing a robust and accurate predictive model is variable selection. It is not uncommon to start this process with a candidate list of several hundred candidate predictors, eventually whittling it down to 10-20. While some sources advocate automated variable selection using, e.g., their significance levels, others point out that “...a purely statistical solution is unrealistic. The role of scientific judgment cannot be overlooked.” [2]; see also [1]. Considering that it may be difficult to implement a manual solution when working with a particularly large number of variables, an automated process, e.g., backward selection, may be used to augment but not supplant the researcher’s judgment; a standard R package, caret, is widely accepted for this purpose [16]. An algorithm for this process is outlined in Fig 3.7.

Figure 3.7. An algorithm for selecting predictive variables..