# Survival modeling

A frequently asked question in healthcare analytics is: “What is the probability of survival for (at least) time \(t\) from now (\(t_0=0\))of an individual with specific conditions?” or, conversely, “What is the expected survival time of a given individual?”. Survival analysis is a form of regression that can help answer these questions. A standard procedure for evaluating survival probability, and, to some extent, expected survival time, is *Cox survival analysis* [12], [18],[26] . At the core of it is the semiparametric.

In the following analysis, we assume that an outcome of interest represents an irreversible state transition (e.g., alive to dead). The probability of an event of interest occurring before time \(t\) is

\(P(t) = Pr(T \leq t) = \int_0^t p(x) dx\; ,\) | (3.26) |

where \(T\) is the time of the event and \(p(x)\) is the probability density of the (possibly unknown) distribution of such an event. The probability of an individual surviving until (at least) time\(t\) is termed the *survival function* and represents the complement of \(P(t)\)

\(S(t) = Pr(T > t) = \int_t^{\infty} p(x) dx\; .\) | (3.27) |

The rate of arrival of outcomes of at time \(t\) is equal to the instantaneous probability of an event at time \(t\) conditional upon surviving until that time and can be calculated as^{[1]}

\begin{eqnarray} h(t) = \lim_{\Delta t \to 0} \frac{P( t \leq T < t + \Delta t | T \geq t)}{\Delta t} \nonumber \\ = \frac{dP(t)}{dt} \frac{1}{S(t)} = \frac{p(t)}{S(t)} \; .</math> \end{eqnarray}

In the Cox model, hazard rate \(h(t)\) is regressed against a set of predictors \(X_i\) as

\(h(t) = h_0(t) e^{\sum_{i=1}^{N} b_i x_i } \; ,\) | (3.29) |

where \(b_i\) is the weighting of \(x_i\), the \(i\)-th of \(N\) explanatory variables. For the population of \(M\) individuals, ([E.SurvMod.4]) can be rewritten as

\( h_i(t) = h_{i_0}(t) e^{\sum_{j=1}^{N} b_{j} x_{ij} } \; , i = \overline{1:N}, j = \overline{1:M} \; .\) | (3.30) |

Taking the ()natural) logarithm of both sides of (3.30) we arrive at the equivalent of (3.18):

\( \ln \frac{h_i(t)}{h_{i_0}(t)} = \sum_{j=1}^{N} b_{j} x_{ij} \; , i = \overline{1:N}, j = \overline{1:M} \; .\) | (3.31) |

The form of \(h_{i_0}(t)\) is not formally specified; its shape is determined by empirical data in the training dataset giving rise to the unparametric portion of the model.^{[2]} The solution of (3.26) is delivered by the maximum of the *partial likelihood function* defined in as

\(L_p = \prod_{i=1}^{N} \left [ \frac{e^{x_i \beta}}{\sum_{j=1}{N} Y_{ij} e^{x_i \beta}} \right ]^{\delta_i} \; , \\ Y_{ij} = \begin{cases} 0, \text{if } t_j < t_i \; , \\ 1, \text{otherwise} \;. \end{cases} \; , \\ \delta_{i} = \begin{cases} 0, \text{if event did not occur at time} t_i \; , \\ 1, \text{otherwise} \;. \end{cases} \; ,\) | (3.32) |

A widely accepted standard for survival analysis in R is the `survival`

package ; a noteworthy extension of it that takes into account relative survival probability is `relsurv`

[27].

Two important variables to consider in survival analysis are time since first diagnosis and age. The former reflects the individual’s “lifetime” measured with respect to others with the same group of conditions, the latter relates his or her expected risk of experiencing a negative outcome to that of the general population.

It is important to distinguish survival analysis, which is characterized by an impenetrable boundary between the sets with null and eventful outcomes, from renewal analysis where such boundary can be crossed. Clearly, the transition from alive to deceased can occur only once whereas the transition between healthy and ill can occur multiple times. Renewal analysis is governed by a similar set of equations but is conceptually different from survival analysis.

An illustrative performance comparison between a regular logistic regression model and Cox proportional hazard model used for predicting one-year mortality among heart failure patients is presented in Fig. 3.6.

As can be seen from Fig. 3.6, the AUC for model in question is approximately 0.81. The attained maxima are approximately 0.45 for \(F_1\) score and 0.4 for Matthews' correlation coefficient, however, those maxima are attained at approximately 45% of the total population for the logistic regression model and 5% for the Cox proportional hazard model. This leads us to believe that, in this particular case, the latter achieves optimal accuracy for smaller population samples than the former, however, overall model accuracies is virtually identical.

The R code for performing predictive modeling in the example above can be found in Appendix D.