# Difference between revisions of "Assessing model coefficients"

(Created page with "The coefficients of a linear or logistic regression are computed using a variant of the normal equation (3.10). In reality, this relationship includes the random error compone...") |
m |
||

Line 63: | Line 63: | ||

[[File:Table_3-19.JPG|600px|center|'''Table 3.19''']] | [[File:Table_3-19.JPG|600px|center|'''Table 3.19''']] | ||

[[File:Table_3-19b.JPG|600px|center|'''Table 3.19''']] | [[File:Table_3-19b.JPG|600px|center|'''Table 3.19''']] | ||

− | Table 3.19: Example: COPD logistic regression model | + | Table 3.19: Example: COPD logistic regression model coefficients and their statistics |

As follows from Table 3.19, HOSP_365_DAYS_COPD_IND, ER_VISITS_365_DAYS_COPD_IND, SMOKER_IND, O2_IND and EJFR_IND have the most impact on the estimated probability of outcome of interest and are statistically significantly different from 0. From the clinical perspective, this makes prefect sense. On the other hand, automatically removing from the model those variables that are not statistically significantly different from 0 may result in loss of information and is not generally recommended. | As follows from Table 3.19, HOSP_365_DAYS_COPD_IND, ER_VISITS_365_DAYS_COPD_IND, SMOKER_IND, O2_IND and EJFR_IND have the most impact on the estimated probability of outcome of interest and are statistically significantly different from 0. From the clinical perspective, this makes prefect sense. On the other hand, automatically removing from the model those variables that are not statistically significantly different from 0 may result in loss of information and is not generally recommended. |

## Latest revision as of 16:33, 27 June 2016

The coefficients of a linear or logistic regression are computed using a variant of the normal equation (3.10). In reality, this relationship includes the random error component

\( y = \sum_{i=0}^N a_i x_i + \epsilon \; ,\) | (3.55) |

\( x_0 = 1 \; ,\) | (3.56) |

where the intercept has been incorporated into the general equation for convenience by virtue of (3.55). Coefficients \(a_i\), obtained with the help of (3.55) - (3.56), are estimates, albeit *unbiased* [21]; the uncertainty in their calculation is implied by the random nature of \(\epsilon\). If we assume the normality of errors, \(\epsilon \sim \mathcal{N} (0, \sigma^2)\), then the standard null hypotheses \(H_0(a_i) : a_i=0\) can then be tested by computing the t-statistic

\( t_i = \frac{\hat{a_i} - a_{i0}}{s.e.(\hat{a_i})} \; , \; i=\overline{1,N} \; ,\) | (3.57) |

\( s.e.(\hat{a_i}) = \sqrt{\frac{MS_{Res}}{S_{xx}}} \; , \) | (3.58) |

\(MS_{Res} = \frac{1}{N-2}\sum_{i-1}^N \epsilon_i^2 \; , \) | (3.59) |

\( S_{xx} = \sum_{i=1}^N\left ( x_i - \overline{x} \right )^2 \;\) | (3.60) |

\( \overline{x} = \frac{1}{N} \sum_{i=1}^N x_i \; \) | (3.61) |

\(t_0 = \frac{\hat{a_0} - a_{00}}{s.e.(\hat{a_0})} \; ,\) | (3.62) |

\(s.e.(\hat{a_0}) = \sqrt{MS_{Res} \left ( \frac{1}{N} + \frac{\overline{x}^2}{S_{xx}} \right )} \; ,\) | (3.63) |

\( t_i \sim \chi^2_{N-2} \; .\) | (3.64) |

The significance of the coefficient, i.e., the probability that it comes from a distribution centered at \(0\) is determined by the test statistic \(t_i\). In view of 3.64, we can compute the appropriate \(p\)-values at the \(\alpha\) significance level and construct the usual confidence intervals for \(a_i \; , i=\overline{1,N}\) as

\( a_i \in \left [ \hat{a_i} - t_{\frac{\alpha}{2}, N-2} \times s.e.(a_i) , \hat{a_i} + t_{\frac{\alpha}{2}, N-2} \times s.e.(a_i) \right ] \; ,\) | (3.65) |

In our outgoing COPD example, we can now finalize the set of predictive variables and create a model for testing and validation. Drawing upon the results presented in Table 3.18 and Section 3.7.1, we select the coefficients for model (3.18) based on the statistical significance of their odds ratios and subject matter knowledge, and calculate their statistics presented in Table 3.19.

Table 3.19: Example: COPD logistic regression model coefficients and their statistics

As follows from Table 3.19, HOSP_365_DAYS_COPD_IND, ER_VISITS_365_DAYS_COPD_IND, SMOKER_IND, O2_IND and EJFR_IND have the most impact on the estimated probability of outcome of interest and are statistically significantly different from 0. From the clinical perspective, this makes prefect sense. On the other hand, automatically removing from the model those variables that are not statistically significantly different from 0 may result in loss of information and is not generally recommended.