From NorthShore Analytics
Jump to: navigation, search

Outliers in the input data can be detected by examining the distribution of each independent variable. The following algorithm is suggested for detecting and eliminating outliers:

  1. sort the values of a predictor variable in the ascending or descending order, depending on the nature of the variable;
  2. eliminate obvious outliers, e.g., negative costs or 1000 mmHg blood pressure, by setting them to a predetermined fixed value (e.g., \(0\)) or a specified aggregate statistic of the distribution (e.g., median value);
  3. plot the histogram of the distribution and visually inspect it;
  4. if the parametric form of the distribution is known or can be inferred from theoretical or practical considerations, attempt to fit the distribution to its hypothesized shape and purge the “tails” (can be done for either normal or non-normal cases).
  5. truncate the distribution if necessary (this should be considered the last resort);