 # lifelines¶

lifelines is an implementation of survival analysis in Python. What benefits does lifelines offer over other survival analysis implementations?

• built on top of Pandas
• internal plotting methods
• simple and intuitive API
• only focus is survival analysis
• handles right, left and interval censored data

# Contents:¶ ## Quickstart¶

### Installation¶

Install via pip:

pip install lifelines


### Kaplan-Meier Nelson-Aalen, and parametric models¶

Note

For readers looking for an introduction to survival analysis, it’s recommended to start at Introduction to survival analysis

Let’s start by importing some data. We need the durations that individuals are observed for, and whether they “died” or not.

from lifelines.datasets import load_waltons
df = load_waltons() # returns a Pandas DataFrame

"""
T  E    group
0   6  1  miR-137
1  13  1  miR-137
2  13  1  miR-137
3  13  1  miR-137
4  19  1  miR-137
"""

T = df['T']
E = df['E']


T is an array of durations, E is a either boolean or binary array representing whether the “death” was observed or not (alternatively an individual can be censored). We will fit a Kaplan Meier model to this, implemented as KaplanMeierFitter:

from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(T, event_observed=E)  # or, more succinctly, kmf.fit(T, E)


After calling the fit() method, we have access to new properties like survival_function_ and methods like plot(). The latter is a wrapper around Panda’s internal plotting library.

kmf.survival_function_
kmf.cumulative_density_
kmf.plot_survival_function() # or just kmf.plot() Alternatively, you can plot the cumulative density function:

kmf.plot_cumulative_density() By specifying the timeline keyword argument in fit(), we can change how the above models are indexed:

kmf.fit(T, E, timeline=range(0, 100, 2))

kmf.survival_function_   # index is now the same as range(0, 100, 2)
kmf.confidence_interval_ # index is now the same as range(0, 100, 2)


A useful summary stat is the median survival time, which represents when 50% of the population has died:

from lifelines.utils import median_survival_times

median_ = kmf.median_survival_time_
median_confidence_interval_ = median_survival_times(kmf.confidence_interval_))


Instead of the Kaplan-Meier estimator, you may be interested in a parametric model. lifelines has builtin parametric models. For example, Weibull, Log-Normal, Log-Logistic, and more.

from lifelines import *

fig, axes = plt.subplots(2, 3, figsize=(9, 5))

kmf = KaplanMeierFitter().fit(T, E, label='KaplanMeierFitter')
wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentalFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
ggf = GeneralizedGammaFitter().fit(T, E, label='GeneralizedGammaFitter')

wbf.plot_survival_function(ax=axes)
exf.plot_survival_function(ax=axes)
lnf.plot_survival_function(ax=axes)
kmf.plot_survival_function(ax=axes)
llf.plot_survival_function(ax=axes)
pwf.plot_survival_function(ax=axes)
ggf.plot_survival_function(ax=axes) #### Multiple groups¶

groups = df['group']
ix = (groups == 'miR-137')

kmf.fit(T[~ix], E[~ix], label='control')
ax = kmf.plot()

kmf.fit(T[ix], E[ix], label='miR-137')
ax = kmf.plot(ax=ax) Alternatively, for many more groups and more “pandas-esque”:

ax = plt.subplot(111)

kmf = KaplanMeierFitter()

for name, grouped_df in df.groupby('group'):
kmf.fit(grouped_df["T"], grouped_df["E"], label=name)
kmf.plot(ax=ax)


Similar functionality exists for the NelsonAalenFitter:

from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter()
naf.fit(T, event_observed=E)


but instead of a survival_function_ being exposed, a cumulative_hazard_ is.

Note

Similar to Scikit-Learn, all statistically estimated quantities append an underscore to the property name.

Note

More detailed docs about estimating the survival function and cumulative hazard are available in Survival analysis with lifelines.

### Getting data in the right format¶

Often you’ll have data that looks like::

*start_time1*, *end_time1*
*start_time2*, *end_time2*
*start_time3*, None
*start_time4*, *end_time4*


lifelines has some utility functions to transform this dataset into duration and censoring vectors. The most common one is lifelines.utils.datetimes_to_durations().

from lifelines.utils import datetimes_to_durations

# start_times is a vector or list of datetime objects or datetime strings
# end_times is a vector or list of (possibly missing) datetime objects or datetime strings
T, E = datetimes_to_durations(start_times, end_times, freq='h')


Perhaps you are interested in viewing the survival table given some durations and censoring vectors. The function lifelines.utils.survival_table_from_events() will help with that:

from lifelines.utils import survival_table_from_events

table = survival_table_from_events(T, E)

"""
removed  observed  censored  entrance  at_risk
event_at
0               0         0         0       163      163
6               1         1         0         0      163
7               2         1         1         0      162
9               3         3         0         0      160
13              3         3         0         0      157
"""


### Survival regression¶

While the above KaplanMeierFitter model is useful, it only gives us an “average” view of the population. Often we have specific data at the individual level that we would like to use. For this, we turn to survival regression.

Note

More detailed documentation and tutorials are available in Survival Regression.

from lifelines.datasets import load_regression_dataset



The input of the fit method’s API in a regression model is different. All the data, including durations, censored indicators and covariates must be contained in a Pandas DataFrame. The duration column and event occurred column are specified in the call to fit. Below we model our regression dataset using the Cox proportional hazard model, full docs here.

from lifelines import CoxPHFitter

# Using Cox Proportional Hazards model
cph = CoxPHFitter()
cph.fit(regression_dataset, 'T', event_col='E')
cph.print_summary()

"""
<lifelines.CoxPHFitter: fitted with 200 observations, 11 censored>
duration col = 'T'
event col = 'E'
number of subjects = 200
number of events = 189
partial log-likelihood = -807.62
time fit was run = 2019-07-31 10:22:07 UTC

---
coef exp(coef)  se(coef)  coef lower 95%  coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
var1  0.22      1.25      0.07            0.08            0.37                1.08                1.44
var2  0.05      1.05      0.08           -0.11            0.21                0.89                1.24
var3  0.22      1.24      0.08            0.07            0.37                1.07                1.44

z      p  -log2(p)
var1 2.99 <0.005      8.49
var2 0.61   0.54      0.89
var3 2.88 <0.005      7.97
---
Concordance = 0.58
Log-likelihood ratio test = 15.54 on 3 df, -log2(p)=9.47
"""

cph.plot() The same dataset, but with a Weibull accelerated failure time model. This model was two parameters (see docs here), and we can choose to model both using our covariates or just one. Below we model just the scale parameter, lambda_.

from lifelines import WeibullAFTFitter

wft = WeibullAFTFitter()
wft.fit(regression_dataset, 'T', event_col='E')
wft.print_summary()

"""
<lifelines.WeibullAFTFitter: fitted with 200 observations, 11 censored>
event col = 'E'
number of subjects = 200
number of events = 189
log-likelihood = -504.48
time fit was run = 2019-07-31 10:19:07 UTC

---
coef exp(coef)  se(coef)  coef lower 95%  coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
lambda_ var1       -0.08      0.92      0.02           -0.13           -0.04                0.88                0.97
var2       -0.02      0.98      0.03           -0.07            0.04                0.93                1.04
var3       -0.08      0.92      0.02           -0.13           -0.03                0.88                0.97
_intercept  2.53     12.57      0.05            2.43            2.63               11.41               13.85
rho_    _intercept  1.09      2.98      0.05            0.99            1.20                2.68                3.32

z      p  -log2(p)
lambda_ var1       -3.45 <0.005     10.78
var2       -0.56   0.57      0.80
var3       -3.33 <0.005     10.15
_intercept 51.12 <0.005       inf
rho_    _intercept 20.12 <0.005    296.66
---
Concordance = 0.58
Log-likelihood ratio test = 19.73 on 3 df, -log2(p)=12.34
"""

wft.plot() Other AFT models are available as well, see here. An alternative regression model is Aalen’s Additive model, which has time-varying hazards:

# Using Aalen's Additive model
aaf.fit(regression_dataset, 'T', event_col='E')


Along with CoxPHFitter and WeibullAFTFitter, after fitting you’ll have access to properties like cumulative_hazards_ and methods like plot, predict_cumulative_hazards, and predict_survival_function. The latter two methods require an additional argument of individual covariates:

X = regression_dataset.drop(['E', 'T'], axis=1)
aaf.predict_survival_function(X.iloc[10:12]).plot()  # get the unique survival functions of two subjects Like the above estimators, there is also a built-in plotting method:

aaf.plot() Note

More detailed documentation and tutorials are available in Survival Regression. ## Introduction to survival analysis¶

### Applications¶

Traditionally, survival analysis was developed to measure lifespans of individuals. An actuary or health professional would ask questions like “how long does this population live for?”, and answer it using survival analysis. For example, the population may be a nation’s population (for actuaries), or a population stricken by a disease (in the medical professionals case). Traditionally, sort of a morbid subject.

The analysis can be further applied to not just traditional births and deaths, but any duration. Medical professionals might be interested in the time between childbirths, where a birth in this case is the event of having a child, and a death is becoming pregnant again! (obviously, we are loose with our definitions of birth and death) Another example is users subscribing to a service: a birth is a user who joins the service, and a death is when the user leaves the service.

### Censoring¶

At the time you want to make inferences about durations, it is possible, likely true, that not all the death events have occurred yet. For example, a medical professional will not wait 50 years for each individual in the study to pass away before investigating – he or she is interested in the effectiveness of improving lifetimes after only a few years, or months possibly.

The individuals in a population who have not been subject to the death event are labeled as right-censored, i.e., we did not (or can not) view the rest of their life history due to some external circumstances. All the information we have on these individuals are their current lifetime durations (which is naturally less than their actual lifetimes).

Note

There is also left-censoring and interval censoring, which are expanded on later.

A common mistake data analysts make is choosing to ignore the right-censored individuals. We will see why this is a mistake next.

Consider a case where the population is actually made up of two subpopulations, $$A$$ and $$B$$. Population $$A$$ has a very small lifespan, say 2 months on average, and population $$B$$ enjoys a much larger lifespan, say 12 months on average. We may not know this distinction beforehand. At $$t=10$$, we wish to investigate the average lifespan for everyone.

In the figure below, the red lines denote the lifespan of individuals where the death event has been observed, and the blue lines denote the lifespan of the right-censored individuals (deaths have not been observed). If we are asked to estimate the average lifetime of our population, and we naively decided to not included the right-censored individuals, it is clear that we would be severely underestimating the true average lifespan.

from lifelines.plotting import plot_lifetimes
import numpy as np
from numpy.random import uniform, exponential

N = 25

CURRENT_TIME = 10

exponential(12) if (uniform() < 0.5) else exponential(2) for i in range(N)
])

ax.set_xlim(0, 25)
ax.vlines(10, 0, 30, lw=2, linestyles='--')
ax.set_xlabel("time")
ax.set_title("Births and deaths of our population, at $t=10$") Observed lifetimes at time 10:
[10.   1.1   8.   10.  3.43   0.63   6.28   1.03   2.37   6.17  10.
0.21   2.71   1.25  10.   3.4  0.62   1.94   0.22   7.43   6.16  10.
9.41  10.  10.]


Furthermore, if we instead simply took the mean of all observed lifespans, including the current lifespans of right-censored instances, we would still be underestimating the true average lifespan. Below we plot the actual lifetimes of all instances (recall we do not see this information at $$t=10$$).

ax = plot_lifetimes(actual_lifetimes, event_observed=death_observed)
ax.vlines(10, 0, 30, lw=2, linestyles='--')
ax.set_xlim(0, 25) Survival analysis was originally developed to solve this type of problem, that is, to deal with estimation when our data is right-censored. Even in the case where all events have been observed, i.e. no censoring, survival analysis is still a very useful tool to understand durations.

The observations need not always start at zero, either. This was done only for understanding in the above example. Consider the example where a customer entering a store is a birth: a customer can enter at any time, and not necessarily at time zero. In survival analysis, durations are relative: individuals may start at different times. (We actually only need the duration of the observation, and not necessarily the start and end time.)

We next introduce the three fundamental objects in survival analysis, the survival function, hazard function and the cumulative hazard function.

### Survival function¶

Let $$T$$ be a (possibly infinite, but always non-negative) random lifetime taken from the population under study. For example, the amount of time a couple is married. Or the time it takes a user to enter a webpage (an infinite time if they never do). The survival function - $$S(t)$$ - of a population is defined as

$S(t) = Pr( T > t)$

In plain English: the survival function defines the probability the death event has not occurred yet at time $$t$$, or equivalently, the probability of surviving past time $$t$$. Note the following properties of the survival function:

1. $$0 \le S(t) \le 1$$
2. $$F_T(t) = 1 - S(t)$$, where $$F_T(t)$$ is the CDF of $$T$$, which implies
3. $$S(t)$$ is a non-increasing function of $$t$$.

Here’s an example of a survival function: ### Hazard function¶

We are also interested in the probability of the death event occurring at time $$t$$, given that the death event has not occurred until time $$t$$. Mathematically, that is:

$\lim_{\delta t \rightarrow 0 } \; Pr( t \le T \le t + \delta t | T > t)$

This quantity goes to 0 as $$\delta t$$ shrinks, so we divide this by the interval $$\delta t$$ (like we might do in calculus). This defines the hazard function at time $$t$$, $$h(t)$$:

$h(t) = \lim_{\delta t \rightarrow 0 } \; \frac{Pr( t \le T \le t + \delta t | T > t)}{\delta t}$

It can be shown that this is equal to:

$h(t) = \frac{-S'(t)}{S(t)}$

and solving this differential equation (cool, it is a differential equation!), we get:

$S(t) = \exp\left( -\int_0^t h(z) \mathrm{d}z \right)$

The integral has a more common name: the cumulative hazard function, denoted $$H(t)$$. We can rewrite the above as:

$S(t) = \exp\left(-H(t) \right)$

What I love about the above equation is that it defines all survival functions. Notice that we can now speak either about the survival function, $$S(t)$$, or the cumulative hazard function, $$H(t)$$, and we can convert back and forth quite easily. Map of the mathematical entities used in the survival analysis and the transforms between them. Don’t panic: lifelines does this all for you.

The two figures below represent the hazard and the cumulative hazard of the survival function in the figure above. ### Next steps¶

Of course, we do not observe the true survival function of a population. We must use the observed data to estimate it. There are many ways to estimate the survival function and the hazard functions, which brings us to estimation using lifelines. ## Estimating univariate models¶

In the previous section, we introduced the use of survival analysis, the need, and the mathematical objects on which it relies. In this article, we will work with real data and the lifelines library to estimate these mathematical objects.

### Estimating the survival function using Kaplan-Meier¶

For this example, we will be investigating the lifetimes of political leaders around the world. A political leader, in this case, is defined by a single individual’s time in office who controls the ruling regime. This political leader could be an elected president, unelected dictator, monarch, etc. The birth event is the start of the individual’s tenure, and the death event is the retirement of the individual. Censoring can occur if they are a) still in offices at the time of dataset compilation (2008), or b) die while in power (this includes assassinations).

For example, the Bush regime began in 2000 and officially ended in 2008 upon his retirement, thus this regime’s lifespan was eight years, and there was a “death” event observed. On the other hand, the JFK regime lasted 2 years, from 1961 and 1963, and the regime’s official death event was not observed – JFK died before his official retirement.

(This is an example that has gladly redefined the birth and death events, and in fact completely flips the idea upside down by using deaths as the censoring event. This is also an example where the current time is not the only cause of censoring; there are the alternative events (e.g., death in office) that can be the cause of censoring.

To estimate the survival function, we first will use the Kaplan-Meier Estimate, defined:

$\hat{S}(t) = \prod_{t_i \lt t} \frac{n_i - d_i}{n_i}$

where $$d_i$$ are the number of death events at time $$t$$ and $$n_i$$ is the number of subjects at risk of death just prior to time $$t$$.

Let’s bring in our dataset.

from lifelines.datasets import load_dd


democracy regime start_year duration observed ctryname cowcode2 politycode un_region_name un_continent_name ehead leaderspellreg
Non-democracy Monarchy 1946 7 1 Afghanistan 700 700 Southern Asia Asia Mohammad Zahir Shah Mohammad Zahir Shah.Afghanistan.1946.1952.Monarchy
Non-democracy Civilian Dict 1953 10 1 Afghanistan 700 700 Southern Asia Asia Sardar Mohammad Daoud Sardar Mohammad Daoud.Afghanistan.1953.1962.Civilian Dict
Non-democracy Monarchy 1963 10 1 Afghanistan 700 700 Southern Asia Asia Mohammad Zahir Shah Mohammad Zahir Shah.Afghanistan.1963.1972.Monarchy
Non-democracy Civilian Dict 1973 5 0 Afghanistan 700 700 Southern Asia Asia Sardar Mohammad Daoud Sardar Mohammad Daoud.Afghanistan.1973.1977.Civilian Dict
Non-democracy Civilian Dict 1978 1 0 Afghanistan 700 700 Southern Asia Asia Nur Mohammad Taraki Nur Mohammad Taraki.Afghanistan.1978.1978.Civilian Dict

From the lifelines library, we’ll need the KaplanMeierFitter for this exercise:

from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()


Note

Other ways to estimate the survival function in lifelines are discussed below.

For this estimation, we need the duration each leader was/has been in office, and whether or not they were observed to have left office (leaders who died in office or were in office in 2008, the latest date this data was record at, do not have observed death events)

We next use the KaplanMeierFitter method fit() to fit the model to the data. (This is similar to, and inspired by, scikit-learn’s fit/predict API).

Below we fit our data with the KaplanMeierFitter:

T = data["duration"]
E = data["observed"]

kmf.fit(T, event_observed=E)


After calling the fit() method, the KaplanMeierFitter has a property called survival_function_ (again, we follow the styling of scikit-learn, and append an underscore to all properties that were computational estimated). The property is a Pandas DataFrame, so we can call plot() on it:

kmf.survival_function_.plot()
plt.title('Survival function of political regimes'); How do we interpret this? The y-axis represents the probability a leader is still around after $$t$$ years, where $$t$$ years is on the x-axis. We see that very few leaders make it past 20 years in office. Of course, like all good stats, we need to report how uncertain we are about these point estimates, i.e., we need confidence intervals. They are computed in the call to fit(), and located under the confidence_interval_ property. (The method uses exponential Greenwood confidence interval. The mathematics are found in these notes.)

$S(t) = Pr( T > t)$

Alternatively, we can call plot() on the KaplanMeierFitter itself to plot both the KM estimate and its confidence intervals:

kmf.plot() The median time in office, which defines the point in time where on average 50% of the population has expired, is a property:

kmf.median_survival_time_
#   4.0


Interesting that it is only four years. That means, around the world, elected leaders have a 50% chance of cessation in four years or less! To get the confidence interval of the median, you can use:

from lifelines.utils import median_survival_times
median_ci = median_survival_times(kmf.confidence_interval_)


Let’s segment on democratic regimes vs non-democratic regimes. Calling plot on either the estimate itself or the fitter object will return an axis object, that can be used for plotting further estimates:

ax = plt.subplot(111)

dem = (data["democracy"] == "Democracy")

kmf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
kmf.plot(ax=ax)
kmf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
kmf.plot(ax=ax)

plt.ylim(0, 1);
plt.title("Lifespans of different global regimes"); We might be interested in estimating the probabilities in between some points. We can do that with the timeline argument. We specify the times we are interested in and are returned a DataFrame with the probabilities of survival at those points:

ax = plt.subplot(111)

t = np.linspace(0, 50, 51)
kmf.fit(T[dem], event_observed=E[dem], timeline=t, label="Democratic Regimes")
ax = kmf.plot(ax=ax)
print("Median survival time of democratic:", kmf.median_)

kmf.fit(T[~dem], event_observed=E[~dem], timeline=t, label="Non-democratic Regimes")
ax = kmf.plot(ax=ax)
print("Median survival time of non-democratic:", kmf.median_)

plt.ylim(0, 1)
plt.title("Lifespans of different global regimes");

"""
Median survival time of democratic: 3
Median survival time of non-democratic: 6
""" It is incredible how much longer these non-democratic regimes exist for. A democratic regime does have a natural bias towards death though: both via elections and natural limits (the US imposes a strict eight-year limit). The median of a non-democratic is only about twice as large as a democratic regime, but the difference is apparent in the tails: if you’re a non-democratic leader, and you’ve made it past the 10 year mark, you probably have a long life ahead. Meanwhile, a democratic leader rarely makes it past ten years, and then have a very short lifetime past that.

Here the difference between survival functions is very obvious, and performing a statistical test seems pedantic. If the curves are more similar, or we possess less data, we may be interested in performing a statistical test. In this case, lifelines contains routines in lifelines.statistics to compare two survival functions. Below we demonstrate this routine. The function lifelines.statistics.logrank_test() is a common statistical test in survival analysis that compares two event series’ generators. If the value returned exceeds some pre-specified value, then we rule that the series have different generators.

from lifelines.statistics import logrank_test

results = logrank_test(T[dem], T[~dem], E[dem], E[~dem], alpha=.99)

results.print_summary()

"""

<lifelines.StatisticalResult>
t_0 = -1
null_distribution = chi squared
degrees_of_freedom = 1
alpha = 0.99

---
test_statistic      p  -log2(p)
260.47  <0.005    192.23
""""


There are alternative (and sometimes better) tests of survival functions, and we explain more here: Statistically compare two populations

Lets compare the different types of regimes present in the dataset:

regime_types = data['regime'].unique()

for i, regime_type in enumerate(regime_types):
ax = plt.subplot(2, 3, i + 1)

ix = data['regime'] == regime_type
kmf.fit(T[ix], E[ix], label=regime_type)
kmf.plot(ax=ax, legend=False)

plt.title(regime_type)
plt.xlim(0, 50)

if i==0:
plt.ylabel('Frac. in power after $n$ years')

plt.tight_layout() #### Getting data into the right format¶

lifelines data format is consistent across all estimator class and functions: an array of individual durations, and the individuals event observation (if any). These are often denoted T and E respectively. For example:

T = [0,3,3,2,1,2]
E = [1,1,0,0,1,1]
kmf.fit(T, event_observed=E)


The raw data is not always available in this format – lifelines includes some helper functions to transform data formats to lifelines format. These are located in the lifelines.utils sub-library. For example, the function datetimes_to_durations() accepts an array or Pandas object of start times/dates, and an array or Pandas objects of end times/dates (or None if not observed):

from lifelines.utils import datetimes_to_durations

start_date = ['2013-10-10 0:00:00', '2013-10-09', '2013-10-10']
end_date = ['2013-10-13', '2013-10-10', None]
T, E = datetimes_to_durations(start_date, end_date, fill_date='2013-10-15')
print('T (durations): ', T)
print('E (event_observed): ', E)

T (durations):  [ 3.  1.  5.]
E (event_observed):  [ True  True False]


The function datetimes_to_durations() is very flexible, and has many keywords to tinker with.

### Estimating hazard rates using Nelson-Aalen¶

The survival functions is a great way to summarize and visualize the survival dataset, however it is not the only way. If we are curious about the hazard function $$h(t)$$ of a population, we unfortunately cannot transform the Kaplan Meier estimate – statistics doesn’t work quite that well. Fortunately, there is a proper non-parametric estimator of the cumulative hazard function:

$H(t) = \int_0^t \lambda(z) \;dz$

The estimator for this quantity is called the Nelson Aalen estimator:

$\hat{H}(t) = \sum_{t_i \le t} \frac{d_i}{n_i}$

where $$d_i$$ is the number of deaths at time $$t_i$$ and $$n_i$$ is the number of susceptible individuals.

In lifelines, this estimator is available as the NelsonAalenFitter. Let’s use the regime dataset from above:

T = data["duration"]
E = data["observed"]

from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter()

naf.fit(T,event_observed=E)


After fitting, the class exposes the property cumulative_hazard_() as a DataFrame:

print(naf.cumulative_hazard_.head())
naf.plot()

   NA-estimate
0     0.000000
1     0.325912
2     0.507356
3     0.671251
4     0.869867

[5 rows x 1 columns] The cumulative hazard has less obvious understanding than the survival functions, but the hazard functions is the basis of more advanced techniques in survival analysis. Recall that we are estimating cumulative hazard functions, $$H(t)$$. (Why? The sum of estimates is much more stable than the point-wise estimates.) Thus we know the rate of change of this curve is an estimate of the hazard function.

Looking at figure above, it looks like the hazard starts off high and gets smaller (as seen by the decreasing rate of change). Let’s break the regimes down between democratic and non-democratic, during the first 20 years:

Note

We are using the loc argument in the call to plot here: it accepts a slice and plots only points within that slice.

naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot(loc=slice(0, 20))

naf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
naf.plot(ax=ax, loc=slice(0, 20))

plt.title("Cumulative hazard function of different global regimes"); Looking at the rates of change, I would say that both political philosophies have a constant hazard, albeit democratic regimes have a much higher constant hazard.

#### Smoothing the hazard function¶

Interpretation of the cumulative hazard function can be difficult – it is not how we usually interpret functions. On the other hand, most survival analysis is done using the cumulative hazard function, so understanding it is recommended.

Alternatively, we can derive the more-interpretable hazard function, but there is a catch. The derivation involves a kernel smoother (to smooth out the differences of the cumulative hazard function) , and this requires us to specify a bandwidth parameter that controls the amount of smoothing. This functionality is in the smoothed_hazard_() and smoothed_hazard_confidence_intervals_() methods. Why methods? They require an argument representing the bandwidth.

There is also a plot_hazard() function (that also requires a bandwidth keyword) that will plot the estimate plus the confidence intervals, similar to the traditional plot() functionality.

bandwidth = 3.

naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_hazard(bandwidth=bandwidth)

naf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
naf.plot_hazard(ax=ax, bandwidth=bandwidth)

plt.title("Hazard function of different global regimes | bandwidth=%.1f" % bandwidth);
plt.ylim(0, 0.4)
plt.xlim(0, 25); It is more clear here which group has the higher hazard, and Non-democratic regimes appear to have a constant hazard.

There is no obvious way to choose a bandwidth, and different bandwidths produce different inferences, so it’s best to be very careful here. My advice: stick with the cumulative hazard function.

bandwidth = 8.0

naf.fit(T[dem], event_observed=E[dem], label="Democratic Regimes")
ax = naf.plot_hazard(bandwidth=bandwidth)

naf.fit(T[~dem], event_observed=E[~dem], label="Non-democratic Regimes")
naf.plot_hazard(ax=ax, bandwidth=bandwidth)

plt.title("Hazard function of different global regimes | bandwidth=%.1f" % bandwidth); ### Estimating cumulative hazards using parametric models¶

#### Fitting to a Weibull model¶

Note

The parameterization of the Weibull and Exponential model changed in lifelines 0.19.0, released in Feb. 2019.

Another very popular model for survival data is the Weibull model. In contrast the the Nelson-Aalen estimator, this model is a parametric model, meaning it has a functional form with parameters that we are fitting the data to. (The Nelson-Aalen estimator has no parameters to fit to). The survival function looks like:

$S(t) = \exp\left(-\left(\frac{t}{\lambda}\right)^\rho\right), \lambda >0, \rho > 0,$

A priori, we do not know what $$\lambda$$ and $$\rho$$ are, but we use the data on hand to estimate these parameters. We model and estimate the cumulative hazard rate instead of the survival function (this is different than the Kaplan-Meier estimator):

$H(t) = \left(\frac{t}{\lambda}\right)^\rho$

In lifelines, estimation is available using the WeibullFitter class. The plot() method will plot the cumulative hazard.

from lifelines import WeibullFitter

T = data['T']
E = data['E']

wf = WeibullFitter().fit(T, E)

wf.print_summary()
wf.plot()

"""
<lifelines.WeibullFitter: fitted with 163 observations, 7 censored>
number of subjects = 163
number of events = 156
log-likelihood = -672.062
hypothesis = lambda != 1, rho != 1

---
coef  se(coef)  lower 0.95  upper 0.95      p  -log2(p)
lambda_  0.02      0.00        0.02        0.02 <0.005       inf
rho_     3.45      0.24        2.97        3.93 <0.005     76.83
""" #### Other parametric models: Exponential, Log-Logistic, Log-Normal and Splines¶

Similarly, there are other parametric models in lifelines. Generally, which parametric model to choose is determined by either knowledge of the distribution of durations, or some sort of model goodness-of-fit. Below are the built-in parametric models, and the Nelson-Aalen non-parametric model, of the same data.

from lifelines import (WeibullFitter, ExponentialFitter,
LogNormalFitter, LogLogisticFitter, NelsonAalenFitter,
PiecewiseExponentialFitter, GeneralizedGammaFitter, SplineFitter)

fig, axes = plt.subplots(3, 3, figsize=(10, 7.5))

T = data['T']
E = data['E']

wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentalFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
naf = NelsonAalenFitter().fit(T, E, label='NelsonAalenFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
gg = GeneralizedGammaFitter().fit(T, E, label='GeneralizedGammaFitter')
spf = SplineFitter([6, 20, 40, 75]).fit(T, E, label='SplineFitter')

wbf.plot_cumulative_hazard(ax=axes)
exf.plot_cumulative_hazard(ax=axes)
lnf.plot_cumulative_hazard(ax=axes)
naf.plot_cumulative_hazard(ax=axes)
llf.plot_cumulative_hazard(ax=axes)
pwf.plot_cumulative_hazard(ax=axes)
gg.plot_cumulative_hazard(ax=axes)
spf.plot_cumulative_hazard(ax=axes) lifelines can also be used to define your own parametric model. There is a tutorial on this available, see Piecewise Exponential Models and Creating Custom Models.

Parametric models can also be used to create and plot the survival function, too. Below we compare the parametric models versus the non-parametric Kaplan-Meier estimate:

from lifelines import KaplanMeierFitter

fig, axes = plt.subplots(3, 3, figsize=(10, 7.5))

T = data['T']
E = data['E']

kmf = KaplanMeierFitter().fit(T, E, label='KaplanMeierFitter')
wbf = WeibullFitter().fit(T, E, label='WeibullFitter')
exf = ExponentialFitter().fit(T, E, label='ExponentalFitter')
lnf = LogNormalFitter().fit(T, E, label='LogNormalFitter')
llf = LogLogisticFitter().fit(T, E, label='LogLogisticFitter')
pwf = PiecewiseExponentialFitter([40, 60]).fit(T, E, label='PiecewiseExponentialFitter')
gg = GeneralizedGammaFitter().fit(T, E, label='GeneralizedGammaFitter')
spf = SplineFitter([6, 20, 40, 75]).fit(T, E, label='SplineFitter')

wbf.plot_survival_function(ax=axes)
exf.plot_survival_function(ax=axes)
lnf.plot_survival_function(ax=axes)
kmf.plot_survival_function(ax=axes)
llf.plot_survival_function(ax=axes)
pwf.plot_survival_function(ax=axes)
gg.plot_survival_function(ax=axes)
spf.plot_survival_function(ax=axes) With parametric models, we have a functional form that allows us to extend the survival function (or hazard or cumulative hazard) past our maximum observed duration. This is called extrapolation. We can do this in a few ways.

timeline = np.linspace(0, 100, 200)

# directly compute the survival function, these return a pandas Series
wbf = WeibullFitter().fit(T, E)
wbf.survival_function_at_times(timeline)
wbf.hazard_at_times(timeline)
wbf.cumulative_hazard_at_times(timeline)

# use the timeline kwarg in fit
# by default, all functions and properties will use
# these values provided
wbf = WeibullFitter().fit(T, E, timeline=timeline)

wbf.plot_survival_function() #### Model Selection¶

When the underlying data generation distribution is unknown, we resort to measures of fit to tell us which model is most appropriate. lifelines has provided qq-plots, Selecting a parametric model using QQ plots, and also tools to compare AIC and other measures: Selecting a parametric model using AIC.

### Other types of censoring¶

#### Left censored data and non-detection¶

We’ve mainly been focusing on right-censoring, which describes cases where we do not observe the death event. This situation is the most common one. Alternatively, there are situations where we do not observe the birth event occurring. Consider the case where a doctor sees a delayed onset of symptoms of an underlying disease. The doctor is unsure when the disease was contracted (birth), but knows it was before the discovery.

Another situation where we have left-censored data is when measurements have only an upper bound, that is, the measurements instruments could only detect the measurement was less than some upper bound. This bound is often called the limit of detection (LOD). In practice, there could be more than one LOD. One very important statistical lesson: don’t “fill-in” this value naively. It’s tempting to use something like one-half the LOD, but this will cause lots of bias in downstream analysis. An example dataset is below:

Note

The recommended API for modeling left-censored data using parametric models changed in version 0.21.0. Below is the recommended API.

from lifelines.datasets import load_nh4

"""
NH4.Orig.mg.per.L  NH4.mg.per.L  Censored
1            <0.006         0.006      True
2            <0.006         0.006      True
3             0.006         0.006     False
4             0.016         0.016     False
5            <0.006         0.006      True
"""


lifelines has support for left-censored datasets in most univariate models, including the KaplanMeierFitter class, by using the lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter.fit_left_censoring() method.

T, E = df['NH4.mg.per.L'], ~df['Censored']

kmf = KaplanMeierFitter()
kmf.fit_left_censoring(T, E)


Instead of producing a survival function, left-censored data analysis is more interested in the cumulative density function. This is available as the lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter.cumulative_density_ property after fitting the data.

print(kmf.cumulative_density_.head())

kmf.plot() #will plot the CDF
plt.xlabel("Concentration of NH_4")

"""
KM_estimate
timeline
0.000        0.379897
0.006        0.401002
0.007        0.464319
0.008        0.478828
0.009        0.536868
""" Alternatively, you can use a parametric model to model the data. This allows for you to “peer” below the LOD, however using a parametric model means you need to correctly specify the distribution. You can use plots like qq-plots to help invalidate some distributions, see Selecting a parametric model using QQ plots and Selecting a parametric model using AIC.

from lifelines import *
from lifelines.plotting import qq_plot

fig, axes = plt.subplots(3, 2, figsize=(9, 9))
timeline = np.linspace(0, 0.25, 100)

wf = WeibullFitter().fit_left_censoring(T, E, label="Weibull", timeline=timeline)
lnf = LogNormalFitter().fit_left_censoring(T, E, label="Log Normal", timeline=timeline)
lgf = LogLogisticFitter().fit_left_censoring(T, E, label="Log Logistic", timeline=timeline)

# plot what we just fit, along with the KMF estimate
kmf.plot_cumulative_density(ax=axes, ci_show=False)
wf.plot_cumulative_density(ax=axes, ci_show=False)
qq_plot(wf, ax=axes)

kmf.plot_cumulative_density(ax=axes, ci_show=False)
lnf.plot_cumulative_density(ax=axes, ci_show=False)
qq_plot(lnf, ax=axes)

kmf.plot_cumulative_density(ax=axes, ci_show=False)
lgf.plot_cumulative_density(ax=axes, ci_show=False)
qq_plot(lgf, ax=axes) Based on the above, the log-normal distribution seems to fit well, and the Weibull not very well at all.

#### Interval censored data¶

Data can also be interval censored. An example of this is periodically recording the population of micro-organisms as they die-off. Their deaths are interval censored because you know a subject died between two observations periods. New to lifelines in version 0.21.0, all parametric models have support for interval censored data.

Note

The API for fit_interval_censoring is different than right and left censored data.

from lifelines.datasets import load_diabetes

wf = WeibullFitter().fit_interval_censoring(lower_bound=df['left'], upper_bound=df['right'])


Another example of using lifelines for interval censored data is located here.

#### Left truncated (late entry) data¶

Another form of bias that is introduced into a dataset is called left-truncation (or late entry). Left-truncation can occur in many situations. One situation is when individuals may have the opportunity to die before entering into the study. For example, if you are measuring time to death of prisoners in prison, the prisoners will enter the study at different ages. So it’s possible there are some counter-factual individuals who would have entered into your study (that is, went to prison), but instead died early.

All univariate fitters, like KaplanMeierFitter and any parametric models, have an optional argument for entry, which is an array of equal size to the duration array. It describes the time between actual “birth” (or “exposure”) to entering the study.

Note

Nothing changes in the duration array: it still measures time from “birth” to time exited study (either by death or censoring). That is, durations refers to the absolute death time rather than a duration relative to the study entry.

Another situation with left-truncation occurs when subjects are exposed before entry into study. For example, a study of time to all-cause mortality of AIDS patients that recruited individuals previously diagnosed with AIDS, possibly years before. In our example below we will use a dataset like this, called the Multicenter Aids Cohort Study. In the figure below, we plot the lifetimes of subjects. A solid line is when the subject was under our observation, and a dashed line represents the unobserved period between diagnosis and study entry. A solid dot at the end of the line represents death.

from lifelines.datasets import load_multicenter_aids_cohort_study

df["T"] - df["W"],
event_observed=df["D"],
entry=df["W"],
event_observed_color="#383838",
event_censored_color="#383838",
left_truncated=True,
)
plt.ylabel("Patient Number")
plt.xlabel("Years from AIDS diagnosis") So subject #77, the subject at the top, was diagnosed with AIDS 7.5 years ago, but wasn’t in our study for the first 4.5 years. From this point-of-view, why can’t we “fill in” the dashed lines and say, for example, “subject #77 lived for 7.5 years”? If we did this, we would severely underestimate chance of dying early on after diagnosis. Why? It’s possible that there were individuals who were diagnosed and then died shortly after, and never had a chance to enter our study. If we did manage to observe them however, they would have depressed the survival function early on. Thus, “filling in” the dashed lines makes us over confident about what occurs in the early period after diagnosis. We can see this below when we model the survival function with and without taking into account late entries.

from lifelines import KaplanMeierFitter

kmf = KaplanMeierFitter()
kmf.fit(df["T"], event_observed=df["D"], entry=df["W"], label='modeling late entries')
ax = kmf.plot()

kmf.fit(df["T"], event_observed=df["D"], label='ignoring late entries')
kmf.plot(ax=ax)  ## Piecewise exponential models and creating custom models¶

This section will be easier if we recall our three mathematical “creatures” and the relationships between them. First is the survival function, $$S(t)$$, that represents the probability of living past some time, $$t$$. Next is the always non-negative and non-decreasing cumulative hazard function, $$H(t)$$. Its relation to $$S(t)$$ is:

$S(t) = \exp\left(-H(t)\right)$

Finally, the hazard function, $$h(t)$$, is the derivative of the cumulative hazard:

$h(t) = \frac{dH(t)}{dt}$

which has the immediate relation to the survival function:

$S(t) = \exp\left(-\int_{0}^t h(s) ds\right)$

Notice that any of the three absolutely defines the other two. Some situations make it easier to define one vs the others. For example, in the Cox model, it’s easist to work with the hazard, $$h(t)$$. In this section on parametric univariate models, it’ll be easiest to work with the cumulative hazard. This is because of an asymmetry in math: derivatives are much easier to compute than integrals. So, if we define the cumulative hazard, both the hazard and survival function are much easier to reason about versus if we define the hazard and ask questions about the other two.

First, let’s revisit some simpler parametric models.

### The Exponential model¶

Recall that the Exponential model has a constant hazard, that is:

$h(t) = \frac{1}{\lambda}$

which implies that the cumulative hazard, $$H(t)$$, has a pretty simple form: $$H(t) = \frac{t}{\lambda}$$. Below we fit this model to some survival data.

:

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

T, E = waltons['T'], waltons['E']

:

from lifelines import ExponentialFitter

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

epf = ExponentialFitter().fit(T, E)
epf.plot_hazard(ax=ax)
epf.plot_cumulative_hazard(ax=ax)

ax.set_title("hazard"); ax.set_title("cumulative_hazard")

epf.print_summary(3)

model lifelines.ExponentialFitter 163 156 -771.913 lambda_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
lambda_ 51.840 4.151 43.705 59.975 12.249 <0.0005 112.180

This model does a poor job of fitting to our data. If I fit a non-parametric model, like the Nelson-Aalen model, to this data, the Exponential’s lack of fit is very obvious.

:

from lifelines import NelsonAalenFitter

ax = epf.plot(figsize=(8,5))

naf = NelsonAalenFitter().fit(T, E)
ax = naf.plot(ax=ax)
plt.legend()

:

<matplotlib.legend.Legend at 0x126f5c630>


It should be clear that the single parameter model is just averaging the hazards over the entire time period. In reality though, the true hazard rate exhibits some complex non-linear behaviour.

### Piecewise Exponential models¶

What if we could break out model into different time periods, and fit an exponential model to each of those? For example, we define the hazard as:

$\begin{split}h(t) = \begin{cases} \lambda_0, & \text{if t \le \tau_0} \\ \lambda_1 & \text{if \tau_0 < t \le \tau_1} \\ \lambda_2 & \text{if \tau_1 < t \le \tau_2} \\ ... \end{cases}\end{split}$

This model should be flexible enough to fit better to our dataset.

The cumulative hazard is only slightly more complicated, but not too much and can still be defined in Python. In lifelines, univariate models are constructed such that one only needs to define the cumulative hazard model with the parameters of interest, and all the hard work of fitting, creating confidence intervals, plotting, etc. is taken care.

For example, lifelines has implemented the PiecewiseExponentialFitter model. Internally, the code is a single function that defines the cumulative hazard. The user specifies where they believe the “breaks” are, and lifelines estimates the best $$\lambda_i$$.

:

from lifelines import PiecewiseExponentialFitter

# looking at the above plot, I think there may be breaks at t=40 and t=60.
pf = PiecewiseExponentialFitter(breakpoints=[40, 60]).fit(T, E)

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

ax = pf.plot(ax=axs)
pf.plot_hazard(ax=axs)

ax = naf.plot(ax=ax, ci_show=False)
axs.set_title("hazard"); axs.set_title("cumulative_hazard")

pf.print_summary(3)


model lifelines.PiecewiseExponentialFitter 163 156 -647.118 lambda_0_ != 1, lambda_1_ != 1, lambda_2_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
lambda_0_ 162.989 27.144 109.787 216.191 5.968 <0.0005 28.630
lambda_1_ 31.366 4.043 23.442 39.290 7.511 <0.0005 43.957
lambda_2_ 4.686 0.624 3.462 5.910 5.902 <0.0005 28.055

We can see a much better fit in this model. A quantitative measure of fit is to compare the log-likelihood between exponential model and the piecewise exponential model (higher is better). The log-likelihood went from -772 to -647, respectively. We could keep going and add more and more breakpoints, but that would end up overfitting to the data.

### Univarite models in lifelines¶

I mentioned that the PiecewiseExponentialFitter was implemented using only its cumulative hazard function. This is not a lie. lifelines has very general semantics for univariate fitters. For example, this is how the entire ExponentialFitter is implemented:

class ExponentialFitter(ParametricUnivariateFitter):

_fitted_parameter_names = ["lambda_"]

def _cumulative_hazard(self, params, times):
lambda_ = params
return times / lambda_


We only need to specify the cumulative hazard function because of the 1:1:1 relationship between the cumulative hazard function and the survival function and the hazard rate. From there, lifelines handles the rest.

### Defining our own survival models¶

To show off the flexability of lifelines univariate models, we’ll create a brand new, never before seen, survival model. Looking at the Nelson-Aalen fit, the cumulative hazard looks looks like their might be an asymptote at $$t=80$$. This may correspond to an absolute upper limit of subjects’ lives. Let’s start with that functional form.

$H_1(t; \alpha) = \frac{\alpha}{(80 - t)}$

We subscript $$1$$ because we’ll investigate other models. In a lifelines univariate model, this is defined in the following code.

Important: in order to compute derivatives, you must use the numpy imported from the autograd library. This is a thin wrapper around the original numpy. Note the import autograd.numpy as np below.

:

from lifelines.fitters import ParametricUnivariateFitter

class InverseTimeHazardFitter(ParametricUnivariateFitter):

# we tell the model what we want the names of the unknown parameters to be
_fitted_parameter_names = ['alpha_']

# this is the only function we need to define. It always takes two arguments:
#   params: an iterable that unpacks the parameters you'll need in the order of _fitted_parameter_names
#   times: a vector of times that will be passed in.
def _cumulative_hazard(self, params, times):
alpha = params
return alpha /(80 - times)

:

itf = InverseTimeHazardFitter()
itf.fit(T, E)
itf.print_summary()

ax = itf.plot(figsize=(8,5))
ax = naf.plot(ax=ax, ci_show=False)
plt.legend()

model lifelines.InverseTimeHazardFitter 163 156 -697.84 alpha_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
alpha_ 21.51 1.72 18.13 24.88 11.91 <0.005 106.22
:

<matplotlib.legend.Legend at 0x127597198>


The best fit of the model to the data is:

$H_1(t) = \frac{21.51}{80-t}$

Our choice of 80 as an asymptote was maybe mistaken, so let’s allow the asymptote to be another parameter:

$H_2(t; \alpha, \beta) = \frac{\alpha}{\beta-t}$

If we define the model this way, we need to add a bound to the values that $$\beta$$ can take. Obviously it can’t be smaller than or equal to the maximum observed duration. Generally, the cumulative hazard must be positive and non-decreasing. Otherwise the model fit will hit convergence problems.

:

class TwoParamInverseTimeHazardFitter(ParametricUnivariateFitter):

_fitted_parameter_names = ['alpha_', 'beta_']

# Sequence of (min, max) pairs for each element in x. None is used to specify no bound
_bounds = [(0, None), (75.0001, None)]

def _cumulative_hazard(self, params, times):
alpha, beta = params
return alpha / (beta - times)

:

two_f = TwoParamInverseTimeHazardFitter()
two_f.fit(T, E)
two_f.print_summary()

ax = itf.plot(ci_show=False, figsize=(8,5))
ax = naf.plot(ax=ax, ci_show=False)
two_f.plot(ax=ax)
plt.legend()

model lifelines.TwoParamInverseTimeHazardFitter 163 156 -685.57 alpha_ != 1, beta_ != 76.0001
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
alpha_ 16.50 1.51 13.55 19.46 10.28 <0.005 79.98
beta_ 76.55 0.38 75.80 77.30 1.44 0.15 2.73
:

<matplotlib.legend.Legend at 0x127247080>


From the output, we see that the value of 76.55 is the suggested asymptote, that is:

$H_2(t) = \frac{16.50} {76.55 - t}$

The curve also appears to track against the Nelson-Aalen model better too. Let’s try one additional parameter, $$\gamma$$, some sort of measure of decay.

$H_3(t; \alpha, \beta, \gamma) = \frac{\alpha}{(\beta-t)^\gamma}$
:

from lifelines.fitters import ParametricUnivariateFitter

class ThreeParamInverseTimeHazardFitter(ParametricUnivariateFitter):

_fitted_parameter_names = ['alpha_', 'beta_', 'gamma_']
_bounds = [(0, None), (75.0001, None), (0, None)]

# this is the only function we need to define. It always takes two arguments:
#   params: an iterable that unpacks the parameters you'll need in the order of _fitted_parameter_names
#   times: a numpy vector of times that will be passed in by the optimizer
def _cumulative_hazard(self, params, times):
a, b, c = params
return a / (b - times) ** c

:

three_f = ThreeParamInverseTimeHazardFitter()
three_f.fit(T, E)
three_f.print_summary()

ax = itf.plot(ci_show=False, figsize=(8,5))
ax = naf.plot(ax=ax, ci_show=False)
ax = two_f.plot(ax=ax, ci_show=False)
ax = three_f.plot(ax=ax)
plt.legend()

model lifelines.ThreeParamInverseTimeHazardFitter 163 156 -649.38 alpha_ != 1, beta_ != 76.0001, gamma_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
alpha_ 1588776.28 3775137.44 -5810357.13 8987909.70 0.42 0.67 0.57
beta_ 100.88 5.88 89.35 112.41 4.23 <0.005 15.38
gamma_ 3.83 0.50 2.85 4.81 5.64 <0.005 25.82
:

<matplotlib.legend.Legend at 0x126aa6f60>


Our new asymptote is at $$t\approx 100, \text{c.i.}=(87, 112)$$. The model appears to fit the early times better than the previous models as well, however our $$\alpha$$ parameter has more uncertainty now. Continuing to add parameters isn’t advisable, as we will overfit to the data.

Why fit parametric models anyways? Taking a step back, we are fitting parametric models and comparing them to the non-parametric Nelson-Aalen. Why not just always use the Nelson-Aalen model?

1. Sometimes we have scientific motivations to use a parametric model. That is, using domain knowledge, we may know the system has a parametric model and we wish to fit to that model.
2. In a parametric model, we are borrowing information from all observations to determine the best parameters. To make this more clear, imagine taking a single observation and changing it’s value wildly. The fitted parameters would change as well. On the other hand, imagine doing the same for a non-parametric model. In this case, only the local survival function or hazard function would change. Because parametric models can borrow information from all observations, and there are much fewer unknowns than a non-parametric model, parametric models are said to be more statistically efficient.
3. Extrapolation: non-parametric models are not easily extended to values outside the observed data. On the other hand, parametric models have no problem with this. However, extrapolation outside observed values is a very dangerous activity.
:

fig, axs = plt.subplots(3, figsize=(7, 8), sharex=True)

new_timeline = np.arange(0, 85)

three_f = ThreeParamInverseTimeHazardFitter().fit(T, E, timeline=new_timeline)

three_f.plot_hazard(label='hazard', ax=axs).legend()
three_f.plot_cumulative_hazard(label='cumulative hazard',  ax=axs).legend()
three_f.plot_survival_function(label='survival function',  ax=axs).legend()

# Hide x labels and tick labels for all but bottom plot.
for ax in axs:
ax.label_outer()



#### 3-parameter Weibull distribution¶

We can easily extend the built-in Weibull model (lifelines.WeibullFitter) to include a new location parameter:

$H(t) = \left(\frac{t - \theta}{\lambda}\right)^\rho$

(When $$\theta = 0$$, this is just the 2-parameter case again). In lifelines custom models, this looks like:

:

import autograd.numpy as np

# I'm shifting this to exaggerate the effect
T_ = T + 10

class ThreeParameterWeibullFitter(ParametricUnivariateFitter):

_fitted_parameter_names = ["lambda_", "rho_", "theta_"]
_bounds = [(0, None), (0, None), (0, T.min()-0.001)]

def _cumulative_hazard(self, params, times):
lambda_, rho_, theta_ = params
return ((times - theta_) / lambda_) ** rho_


:

tpw = ThreeParameterWeibullFitter()
tpw.fit(T_, E)
tpw.print_summary()
ax = tpw.plot_cumulative_hazard(figsize=(8,5))
ax = NelsonAalenFitter().fit(T_, E).plot(ax=ax, ci_show=False)


model lifelines.ThreeParameterWeibullFitter 163 156 -665.49 lambda_ != 1, rho_ != 1, theta_ != 2.9995
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
lambda_ 66.21 6.24 53.98 78.44 10.45 <0.005 82.52
rho_ 4.42 0.62 3.20 5.64 5.49 <0.005 24.53
theta_ 0.00 5.92 -11.59 11.59 -0.51 0.61 0.71

#### Inverse Gaussian distribution¶

The inverse Gaussian distribution is another popular model for survival analysis. Unlike other model, it’s hazard does not asympotically converge to 0, allowing for a long tail of survival. Let’s model this, using the same parameterization from Wikipedia

:

from autograd.scipy.stats import norm

class InverseGaussianFitter(ParametricUnivariateFitter):
_fitted_parameter_names = ['lambda_', 'mu_']

def _cumulative_density(self, params, times):
mu_, lambda_ = params
v = norm.cdf(np.sqrt(lambda_ / times) * (times / mu_ - 1), loc=0, scale=1) + \
np.exp(2 * lambda_ / mu_) * norm.cdf(-np.sqrt(lambda_ / times) * (times / mu_ + 1), loc=0, scale=1)
return v

def _cumulative_hazard(self, params, times):
return -np.log(1-np.clip(self._cumulative_density(params, times), 1e-15, 1-1e-15))

:

igf = InverseGaussianFitter()
igf.fit(T, E)
igf.print_summary()
ax = igf.plot_cumulative_hazard(figsize=(8,5))
ax = NelsonAalenFitter().fit(T, E).plot(ax=ax, ci_show=False)


model lifelines.InverseGaussianFitter 163 156 -724.74 lambda_ != 1, mu_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
lambda_ 50.88 2.31 46.35 55.41 21.58 <0.005 340.67
mu_ 157.46 17.76 122.66 192.26 8.81 <0.005 59.49

#### Gompertz¶

:

class GompertzFitter(ParametricUnivariateFitter):
# this parameterization is slightly different than wikipedia.
_fitted_parameter_names = ['nu_', 'b_']

def _cumulative_hazard(self, params, times):
nu_, b_ = params
return nu_ * (np.expm1(times * b_))

:

T, E = waltons['T'], waltons['E']

ggf = GompertzFitter()
ggf.fit(T, E)
ggf.print_summary()
ax = ggf.plot_cumulative_hazard(figsize=(8,5))
ax = NelsonAalenFitter().fit(T, E).plot(ax=ax, ci_show=False)

model lifelines.GompertzFitter 163 156 -650.60 nu_ != 1, b_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
nu_ 0.01 0.00 0.00 0.02 -234.94 <0.005 inf
b_ 0.08 0.01 0.07 0.09 -159.07 <0.005 inf

#### APGW¶

From the paper, “A Flexible Parametric Modelling Framework for Survival Analysis”, https://arxiv.org/pdf/1901.03212.pdf

:

class APGWFitter(ParametricUnivariateFitter):
# this parameterization is slightly different than wikipedia.
_fitted_parameter_names = ['kappa_', 'gamma_', 'phi_']

def _cumulative_hazard(self, params, t):
kappa_, phi_, gamma_ = params
return (kappa_ + 1) / kappa_ * ((1 + ((phi_ * t) ** gamma_) /(kappa_ + 1)) ** kappa_ -1)


:

apg = APGWFitter()
apg.fit(T, E)
apg.print_summary(2)
ax = apg.plot_cumulative_hazard(figsize=(8,5))
ax = NelsonAalenFitter().fit(T, E).plot(ax=ax, ci_show=False)


model lifelines.APGWFitter 163 156 -655.98 kappa_ != 1, gamma_ != 1, phi_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
kappa_ 19602.21 407112.67 -778323.96 817528.38 0.05 0.96 0.06
gamma_ 0.02 0.00 0.02 0.02 -3627.42 <0.005 inf
phi_ 2.89 0.21 2.48 3.30 9.03 <0.005 62.34

#### Bounded lifetimes using the beta distribution¶

Maybe your data is bounded between 0 and some (unknown) upperbound M? That is, lifetimes can’t be more than M. Maybe you know M, maybe you don’t.

:

n = 100
T = 5 * np.random.random(n)**2
T_censor = 10 * np.random.random(n)**2
E = T < T_censor
T_obs = np.minimum(T, T_censor)


:

from autograd_gamma import betainc

class BetaFitter(ParametricUnivariateFitter):
_fitted_parameter_names = ['alpha_', 'beta_', "m_"]
_bounds = [(0, None), (0, None), (T.max(), None)]

def _cumulative_density(self, params, times):
alpha_, beta_, m_ = params
return betainc(alpha_, beta_, times / m_)

def _cumulative_hazard(self, params, times):
return -np.log(1-self._cumulative_density(params, times))

:

beta_fitter = BetaFitter().fit(T_obs, E)
beta_fitter.plot()
beta_fitter.print_summary()

/Users/camerondavidson-pilon/code/lifelines/lifelines/fitters/__init__.py:936: StatisticalWarning: The diagonal of the variance_matrix_ has negative values. This could be a problem with BetaFitter's fit to the data.

It's advisable to not trust the variances reported, and to be suspicious of the
fitted parameters too. Perform plots of the cumulative hazard to help understand
the latter's bias.

To fix this, try specifying an initial_point kwarg in fit.

warnings.warn(warning_text, utils.StatisticalWarning)
/Users/camerondavidson-pilon/code/lifelines/lifelines/fitters/__init__.py:460: RuntimeWarning: invalid value encountered in sqrt

model lifelines.BetaFitter 100 64 -79.87 alpha_ != 1, beta_ != 1, m_ != 5.92869
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
alpha_ 0.53 0.06 0.40 0.65 -7.34 <0.005 42.10
beta_ 1.15 nan nan nan nan nan nan
m_ 4.93 nan nan nan nan nan nan

## Discrete survival models¶

So far we have only been investigating continous time survival models, where times can take on any positive value. If we want to consider discrete survival times (for example, over the positive integers), we need to make a small adjustment. With discrete survival models, there is a slightly more complicated relationship between the hazard and cumulative hazard. This is because there are two ways to define the cumulative hazard.

$H_1(t) = \sum_i^t h(t_i)$
$H_2(t) = -\log(S(t))$

We also no longer have the relationship that $$h(t) = \frac{d H(t)}{dt}$$, since $$t$$ is no longer continous. Instead, depending on which verion of the cumulative hazard you choose to use (inference will be the same), we have to redefine the hazard function in lifelines.

$h(t) = H_1(t) - H_1(t-1)$
$h(t) = 1 - \exp(H_2(t) - H_2(t+1))$

Here is an example of a discrete survival model, that may not look like a survival model at first, where we use a redefined _hazard function.

Looking for more examples of what you can build? See other unique survival models in the docs on time-lagged survival ## Time-lagged conversion rates and cure models¶

Suppose in our population we have a subpopulation that will never experience the event of interest. Or, for some subjects the event will occur so far in the future that it’s essentially at time infinity. The survival function should not asymptically approach zero, but some positive value. Models that describe this are sometimes called cure models or time-lagged conversion models.

There’s a serious fault in using parametric models for these types of problems that non-parametric models don’t have. The most common parametric models like Weibull, Log-Normal, etc. all have strictly increasing cumulative hazard functions, which means the corresponding survival function will always converge to 0.

Let’s look at an example of this problem. Below I generated some data that has individuals who will not experience the event, not matter how long we wait.

:

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from matplotlib import pyplot as plt
import pandas as pd
plt.style.use('bmh')


:

N = 200
U = np.random.rand(N)
T = -(logit(-np.log(U) / 0.5) - np.random.exponential(2, N) - 6.00) / 0.50

E = ~np.isnan(T)
T[np.isnan(T)] = 50

:

from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter().fit(T, E)
kmf.plot(figsize=(8,4))
plt.ylim(0, 1);


It should be clear that there is an asymptote at around 0.6. The non-parametric model will always show this. If this is true, then the cumulative hazard function should have a horizontal asymptote as well. Let’s use the Nelson-Aalen model to see this.

:

from lifelines import NelsonAalenFitter

naf = NelsonAalenFitter().fit(T, E)
naf.plot(figsize=(8,4))

:

<matplotlib.axes._subplots.AxesSubplot at 0x118f21eb8>


However, when we try a parametric model, we will see that it won’t extrapolate very well. Let’s use the flexible two parameter LogLogisticFitter model.

:

from lifelines import LogLogisticFitter

fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))

t = np.linspace(0, 40)
llf = LogLogisticFitter().fit(T, E, timeline=t)

llf.plot_survival_function(ax=ax)
kmf.plot(ax=ax)

llf.plot_cumulative_hazard(ax=ax)
naf.plot(ax=ax)

t = np.linspace(0, 100)
llf = LogLogisticFitter().fit(T, E, timeline=t)

llf.plot_survival_function(ax=ax)
kmf.plot(ax=ax)

llf.plot_cumulative_hazard(ax=ax)
naf.plot(ax=ax)

:

<matplotlib.axes._subplots.AxesSubplot at 0x104183be0>


The LogLogistic model does a quite terrible job of capturing the not only the asymptotes, but also the intermediate values as well. If we extended the survival function out further, we would see that it eventually nears 0.

### Custom parametric models to handle asymptotes¶

Focusing on modeling the cumulative hazard function, what we would like is a function that increases up to a limit and then tapers off to an asymptote. We can think long and hard about these (I did), but there’s a family of functions that have this property that we are very familiar with: cumulative distribution functions! By their nature, they will asympotically approach 1. And, they are readily available in the SciPy and autograd libraries. So our new model of the cumulative hazard function is:

$H(t; c, \theta) = c F(t; \theta)$

where $$c$$ is the (unknown) horizontal asymptote, and $$\theta$$ is a vector of (unknown) parameters for the CDF, $$F$$.

We can create a custom cumulative hazard model using ParametricUnivariateFitter (for a tutorial on how to create custom models, see this here). Let’s choose the Normal CDF for our model.

Remember we must use the imports from autograd for this, i.e. from autograd.scipy.stats import norm.

:

from autograd.scipy.stats import norm
from lifelines.fitters import ParametricUnivariateFitter

class UpperAsymptoteFitter(ParametricUnivariateFitter):

_fitted_parameter_names = ["c_", "mu_", "sigma_"]

_bounds = ((0, None), (None, None), (0, None))

def _cumulative_hazard(self, params, times):
c, mu, sigma = params
return c * norm.cdf((times - mu) / sigma, loc=0, scale=1)

:

uaf = UpperAsymptoteFitter().fit(T, E)
uaf.print_summary(3)
uaf.plot(figsize=(8,4))

model lifelines.UpperAsymptoteFitter 200 73 -350.868 c_ != 1, mu_ != 0, sigma_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
c_ 0.456 0.054 0.350 0.561 -10.121 <0.0005 77.577
mu_ 17.594 0.582 16.454 18.735 30.236 <0.0005 664.709
sigma_ 4.923 0.408 4.123 5.724 9.608 <0.0005 70.194
:

<matplotlib.axes._subplots.AxesSubplot at 0x119c57e10>


We get a lovely asympotical cumulative hazard. The summary table suggests that the best value of $$c$$ is 0.586. This can be translated into the survival function asymptote by $$\exp(-0.586) \approx 0.56$$.

Let’s compare this fit to the non-parametric models.

:

fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(10, 6))

t = np.linspace(0, 40)
uaf = UpperAsymptoteFitter().fit(T, E, timeline=t)

uaf.plot_survival_function(ax=ax)
kmf.plot(ax=ax)

uaf.plot_cumulative_hazard(ax=ax)
naf.plot(ax=ax)

t = np.linspace(0, 100)
uaf = UpperAsymptoteFitter().fit(T, E, timeline=t)
uaf.plot_survival_function(ax=ax)
kmf.survival_function_.plot(ax=ax)

uaf.plot_cumulative_hazard(ax=ax)
naf.plot(ax=ax)

:

<matplotlib.axes._subplots.AxesSubplot at 0x1191784e0>


I wasn’t expect this good of a fit. But there it is. This was some artificial data, but let’s try this technique on some real life data.

:

from lifelines.datasets import load_leukemia, load_kidney_transplant

uaf.fit(T, E)
ax = uaf.plot_survival_function(figsize=(8,4))
uaf.print_summary()

kmf.fit(T, E).plot(ax=ax, ci_show=False)
print("---")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(-uaf.summary.loc['c_', 'coef']),
np.exp(-uaf.summary.loc['c_', 'coef upper 95%']),
np.exp(-uaf.summary.loc['c_', 'coef lower 95%']),
)
)

model lifelines.UpperAsymptoteFitter 42 30 -118.60 c_ != 1, mu_ != 0, sigma_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
c_ 1.63 0.36 0.94 2.33 1.78 0.07 3.75
mu_ 13.44 1.73 10.06 16.82 7.79 <0.005 47.07
sigma_ 7.03 1.07 4.94 9.12 5.65 <0.005 25.91
---
Estimated lower bound: 0.20 (0.10, 0.39)


So we might expect that about 20% will not have the even in this population (but make note of the large CI bounds too!)

:

# Another, less obvious, dataset.

uaf.fit(T, E)
ax = uaf.plot_survival_function(figsize=(8,4))
uaf.print_summary()

kmf.fit(T, E).plot(ax=ax)
print("---")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(-uaf.summary.loc['c_', 'coef']),
np.exp(-uaf.summary.loc['c_', 'coef upper 95%']),
np.exp(-uaf.summary.loc['c_', 'coef lower 95%']),
)
)

model lifelines.UpperAsymptoteFitter 863 140 -1458.88 c_ != 1, mu_ != 0, sigma_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
c_ 0.29 0.03 0.24 0.35 -24.38 <0.005 433.78
mu_ 1139.66 101.52 940.68 1338.63 11.23 <0.005 94.73
sigma_ 872.26 66.24 742.44 1002.08 13.15 <0.005 128.86
---
Estimated lower bound: 0.75 (0.70, 0.79)


#### Using alternative functional forms¶

An even simplier model might look like $$c \left(1 - \frac{1}{\lambda t + 1}\right)$$, however this model cannot handle any “inflection points” like our artificial dataset has above. However, it works well for this Lung dataset.

With all cure models, one important feature is the ability to extrapolate to unforseen times.

:

from autograd.scipy.stats import norm
from lifelines.fitters import ParametricUnivariateFitter

class SimpleUpperAsymptoteFitter(ParametricUnivariateFitter):

_fitted_parameter_names = ["c_", "lambda_"]

_bounds = ((0, None), (0, None))

def _cumulative_hazard(self, params, times):
c, lambda_ = params
return c * (1 -  1 /(lambda_ * times + 1))

:

# Another, less obvious, dataset.

saf = SimpleUpperAsymptoteFitter().fit(T, E, timeline=np.arange(1, 10000))
ax = saf.plot_survival_function(figsize=(8,4))
saf.print_summary(4)

kmf.fit(T, E).plot(ax=ax)
print("---")
print("Estimated lower bound: {:.2f} ({:.2f}, {:.2f})".format(
np.exp(-saf.summary.loc['c_', 'coef']),
np.exp(-saf.summary.loc['c_', 'coef upper 95%']),
np.exp(-saf.summary.loc['c_', 'coef lower 95%']),
)
)

model lifelines.SimpleUpperAsymptoteFitter 863 140 -1392.1610 c_ != 1, lambda_ != 1
coef se(coef) coef lower 95% coef upper 95% z p -log2(p)
c_ 0.4252 0.0717 0.2847 0.5658 -8.0154 <5e-05 49.6941
lambda_ 0.0006 0.0002 0.0003 0.0009 -5982.8361 <5e-05 inf
---
Estimated lower bound: 0.65 (0.57, 0.75)

[ ]:



[ ]: ## Survival regression¶

Often we have additional data aside from the duration that we want to use. The technique is called survival regression – the name implies we regress covariates (e.g., age, country, etc.) against another variable – in this case durations. Similar to the logic in the first part of this tutorial, we cannot use traditional methods like linear regression because of censoring.

There are a few popular models in survival regression: Cox’s model, accelerated failure models, and Aalen’s additive model. All models attempt to represent the hazard rate $$h(t | x)$$ as a function of $$t$$ and some covariates $$x$$. We explore these models next.

### The dataset for regression¶

The dataset required for survival regression must be in the format of a Pandas DataFrame. Each row of the DataFrame should be an observation. There should be a column denoting the durations of the observations. There may be a column denoting the event status of each observation (1 if event occurred, 0 if censored). There are also the additional covariates you wish to regress against. Optionally, there could be columns in the DataFrame that are used for stratification, weights, and clusters which will be discussed later in this tutorial.

An example dataset we will use is the Rossi recidivism dataset, available in lifelines as load_rossi().

from lifelines.datasets import load_rossi

"""
week  arrest  fin  age  race  wexp  mar  paro  prio
0      20       1    0   27     1     0    0     1     3
1      17       1    0   18     1     0    0     1     8
2      25       1    0   19     0     1    0     1    13
3      52       0    1   23     1     1    1     1     1
"""


The DataFrame rossi contains 432 observations. The week column is the duration, the arrest column is the event occurred, and the other columns represent variables we wish to regress against.

If you need to first clean or transform your dataset (encode categorical variables, add interaction terms, etc.), that should happen before using lifelines. Libraries like Pandas and Patsy help with that.

### Cox’s proportional hazard model¶

The idea behind Cox’s proportional hazard model model is that the log-hazard of an individual is a linear function of their static covariates and a population-level baseline hazard that changes over time. Mathematically:

$\underbrace{h(t | x)}_{\text{hazard}} = \overbrace{b_0(t)}^{\text{baseline hazard}} \underbrace{\exp \overbrace{\left(\sum_{i=1}^n b_i (x_i - \overline{x_i})\right)}^{\text{log-partial hazard}}}_ {\text{partial hazard}}$

Note a few facts about this model: the only time component is in the baseline hazard, $$b_0(t)$$. In the above product, the partial hazard is a time-invariant scalar factor that only increases or decreases the baseline hazard. Thus a changes in covariates will only increase or decrease the baseline hazard.

Note

In other regression models, a column of 1s might be added that represents that intercept or baseline. This is not necessary in the Cox model. In fact, there is no intercept in the additive Cox model - the baseline hazard represents this. lifelines will will throw warnings and may experience convergence errors if a column of 1s is present in your dataset.

#### Fitting the regression¶

The implementation of the Cox model in lifelines is under CoxPHFitter. Like R, it has a print_summary() function that prints a tabular view of coefficients and related stats.

from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest')

cph.print_summary()  # access the results using cph.summary

"""
<lifelines.CoxPHFitter: fitted with 432 total observations, 318 right-censored observations>
duration col = 'week'
event col = 'arrest'
number of observations = 432
number of events observed = 114
partial log-likelihood = -658.75
time fit was run = 2019-10-05 14:24:44 UTC

---
coef  exp(coef)   se(coef)   coef lower 95%   coef upper 95%  exp(coef) lower 95%  exp(coef) upper 95%
fin   -0.38       0.68       0.19            -0.75            -0.00                 0.47                 1.00
age   -0.06       0.94       0.02            -0.10            -0.01                 0.90                 0.99
race   0.31       1.37       0.31            -0.29             0.92                 0.75                 2.50
wexp  -0.15       0.86       0.21            -0.57             0.27                 0.57                 1.30
mar   -0.43       0.65       0.38            -1.18             0.31                 0.31                 1.37
paro  -0.08       0.92       0.20            -0.47             0.30                 0.63                 1.35
prio   0.09       1.10       0.03             0.04             0.15                 1.04                 1.16

z      p   -log2(p)
fin  -1.98   0.05       4.40
age  -2.61   0.01       6.79
race  1.02   0.31       1.70
wexp -0.71   0.48       1.06
mar  -1.14   0.26       1.97
paro -0.43   0.66       0.59
prio  3.19 <0.005       9.48
---
Concordance = 0.64
Log-likelihood ratio test = 33.27 on 7 df, -log2(p)=15.37
"""


To access the coefficients and the baseline hazard directly, you can use params_ and baseline_hazard_ respectively. Taking a look at these coefficients for a moment, prio (the number of prior arrests) has a coefficient of about 0.09. Thus, a one unit increase in prio means the the baseline hazard will increase by a factor of $$\exp{(0.09)} = 1.10$$ - about a 10% increase. Recall, in the Cox proportional hazard model, a higher hazard means more at risk of the event occurring. The value $$\exp{(0.09)}$$ is called the hazard ratio, a name that will be clear with another example.

Consider the coefficient of mar (whether the subject is married or not). The values in the column are binary: 0 or 1, representing either not married or married. The value of the coefficient associated with mar, $$\exp{(-.43)}$$, is the value of ratio of hazards associated with being married, that is:

$\exp(-0.43) = \frac{\text{hazard of married subjects at time t}}{\text{hazard of unmarried subjects at time t}}$

Note that left-hand side is a constant (specifically, it’s independent of time, $$t$$), but the right-hand side has two factors that can vary with time. The proportional assumption is that this is true in reality. That is, hazards can change over time, but their ratio between levels remains a constant. Later we will deal with checking this assumption.

#### Convergence¶

Fitting the Cox model to the data involves using iterative methods. lifelines takes extra effort to help with convergence, so please be attentive to any warnings that appear. Fixing any warnings will generally help convergence and decrease the number of iterative steps required. If you wish to see the fitting, there is a show_progress parameter in fit() function. For further help, see Problems with convergence in the Cox proportional hazard model.

After fitting, the value of the maximum log-likelihood this available using log_likelihood. The variance matrix of the coefficients is available under variance_matrix_.

#### Goodness of fit¶

After fitting, you may want to know how “good” of a fit your model was to the data. A few methods the author has found useful is to

#### Prediction¶

After fitting, you can use use the suite of prediction methods: predict_partial_hazard(), predict_survival_function(), and others.

X = rossi_dataset

cph.predict_partial_hazard(X)
cph.predict_survival_function(X)
cph.predict_median(X)
...


A common use case is to predict the event time of censored subjects. This is easy to do, but we first have to calculate an important conditional probability. Let $$T$$ be the (random) event time for some subject, and $$S(t)≔P(T > t)$$ be their survival function. We are interested to answer the following: What is a subject’s new survival function given I know the subject has lived past time s? Mathematically:

\begin{split}\begin{align*} P(T > t \;|\; T > s) &= \frac{P(T > t \;\text{and}\; T > s)}{P(T > s)} \\ &= \frac{P(T > t)}{P(T > s)} \\ &= \frac{S(t)}{S(s)} \end{align*}\end{split}

Thus we scale the original survival function by the survival function at time $$s$$ (everything prior to $$s$$ should be mapped to 1.0 as well, since we are working with probabilities and we know that the subject was alive before $$s$$).

Back to our original problem of predicting the event time of censored individuals, lifelines has all this math and logic built in when using the conditional_after kwarg.

# filter down to just censored subjects to predict remaining survival
censored_subjects = X.loc[~X['arrest'].astype(bool)]
censored_subjects_last_obs = censored_subjects['week']

cph.predict_survival_function(censored_subjects, times=[5., 25., 50.], conditional_after=censored_subjects_last_obs)
cph.predict_median(censored_subjects, conditional_after=censored_subjects_last_obs)


#### Plotting the coefficients¶

With a fitted model, an alternative way to view the coefficients and their ranges is to use the plot method.

from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest')

cph.plot() #### Plotting the effect of varying a covariate¶

After fitting, we can plot what the survival curves look like as we vary a single covariate while holding everything else equal. This is useful to understand the impact of a covariate, given the model. To do this, we use the plot_covariate_groups() method and give it the covariate of interest, and the values to display.

from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(rossi_dataset, duration_col='week', event_col='arrest')

cph.plot_covariate_groups('prio', [0, 2, 4, 6, 8, 10], cmap='coolwarm') The plot_covariate_groups() method can accept multiple covariates as well. This is useful for two purposes:

1. There are derivative features in your dataset. For example, suppose you have included year and year**2 in your dataset. It doesn’t make sense to just vary year and leave year**2 fixed. You’ll need to specify manually the values the covariates take on in a N-d array or list (where N is the number of covariates being varied.)
cph.plot_covariate_groups(
['year', 'year**2'],
[
[0, 0],
[1, 1],
[2, 4],
[3, 9],
[8, 64],
],
cmap='coolwarm')

1. This feature is also useful for analyzing categorical variables. In your regression, you may have dummy variables (also called one-hot-encoded variables) in your DataFrame that represent some categorical variable. To simultaneously plot the survival curves of each category, all else being equal, we can use:
cph.plot_covariate_groups(
['d1', 'd2' 'd3', 'd4', 'd5'],
np.eye(5),
cmap='coolwarm')


The reason why we use np.eye is because we want each row of the matrix to “turn on” one category and “turn off” the others.

#### Checking the proportional hazards assumption¶

To make proper inferences, we should ask if our Cox model is appropriate for our dataset. Recall from above that when using the Cox model, we are implicitly applying the proportional hazard assumption. We should ask, does our dataset obey this assumption?

CoxPHFitter has a check_assumptions() method that will output violations of the proportional hazard assumption. For a tutorial on how to fix violations, see Testing the Proportional Hazard Assumptions. Suggestions are to look for ways to stratify a column (see docs below), or use a time varying model.

Note

Checking assumptions like this is only necessary if your goal is inference or correlation. That is, you wish to understand the influence of a covariate on the survival duration & outcome. If your goal is prediction, checking model assumptions is less important since your goal is to maximize an accuracy metric, and not learn about how the model is making that prediction.

#### Stratification¶

Sometimes one or more covariates may not obey the proportional hazard assumption. In this case, we can allow the covariate(s) to still be including in the model without estimating its effect. This is called stratification. At a high level, think of it as splitting the dataset into N smaller datasets, defined by the unique values of the stratifying covariate(s). Each dataset has its own baseline hazard (the non-parametric part of the model), but they all share the regression parameters (the parametric part of the model). Since covariates are the same within each dataset, there is no regression parameter for the covariates stratified on, hence they will not show up in the output. However there will be N baseline hazards under baseline_cumulative_hazard_.

To specify variables to be used in stratification, we define them in the call to fit():

from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(rossi_dataset, 'week', event_col='arrest', strata=['race'])

cph.print_summary()  # access the results using cph.summary

"""
<lifelines.CoxPHFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
strata = ['race']
number of subjects = 432
number of events = 114
log-likelihood = -620.56
time fit was run = 2019-01-27 23:08:35 UTC

---
coef  exp(coef)  se(coef)     z      p  -log2(p)  lower 0.95  upper 0.95
fin  -0.38       0.68      0.19 -1.98   0.05      4.39       -0.75       -0.00
age  -0.06       0.94      0.02 -2.62   0.01      6.83       -0.10       -0.01
wexp -0.14       0.87      0.21 -0.67   0.50      0.99       -0.56        0.27
mar  -0.44       0.64      0.38 -1.15   0.25      2.00       -1.19        0.31
paro -0.09       0.92      0.20 -0.44   0.66      0.60       -0.47        0.30
prio  0.09       1.10      0.03  3.21 <0.005      9.56        0.04        0.15
---
Concordance = 0.64
Likelihood ratio test = 109.63 on 6 df, -log2(p)=68.48
"""

cph.baseline_cumulative_hazard_.shape
# (49, 2)


#### Weights & robust errors¶

Observations can come with weights, as well. These weights may be integer values representing some commonly occurring observation, or they may be float values representing some sampling weights (ex: inverse probability weights). In the fit() method, an kwarg is present for specifying which column in the DataFrame should be used as weights, ex: CoxPHFitter(df, 'T', 'E', weights_col='weights').

When using sampling weights, it’s correct to also change the standard error calculations. That is done by turning on the robust flag in fit(). Internally, CoxPHFitter will use the sandwich estimator to compute the errors.

from lifelines import CoxPHFitter

df = pd.DataFrame({
'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
'weights': [1.1, 0.5, 2.0, 1.6, 1.2, 4.3, 1.4, 4.5, 3.0, 3.2, 0.4, 6.2],
'month': [10, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
})

cph = CoxPHFitter()
cph.fit(df, 'T', 'E', weights_col='weights', robust=True)
cph.print_summary()


See more examples in Adding weights to observations in a Cox model.

#### Clusters & correlations¶

Another property your dataset may have is groups of related subjects. This could be caused by:

• a single individual having multiple occurrences, and hence showing up in the dataset more than once.
• subjects that share some common property, like members of the same family or being matched on propensity scores.

We call these grouped subjects “clusters”, and assume they are designated by some column in the DataFrame (example below). When using cluster, the point estimates of the model don’t change, but the standard errors will increase. An intuitive argument for this is that 100 observations on 100 individuals provide more information than 100 observations on 10 individuals (or clusters).

from lifelines import CoxPHFitter

df = pd.DataFrame({
'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
'month': [10, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'id': [1, 1, 1, 1, 2, 3, 3, 4, 4, 5, 6, 7]
})

cph = CoxPHFitter()
cph.fit(df, 'T', 'E', cluster_col='id')
cph.print_summary()


For more examples, see Correlations between subjects in a Cox model.

#### Residuals¶

After fitting a Cox model, we can look back and compute important model residuals. These residuals can tell us about non-linearities not captured, violations of proportional hazards, and help us answer other useful modeling questions. See Assessing Cox model fit using residuals.

#### Baseline hazard and survival¶

To access the non-parametric baseline hazard and baseline survival, one can use baseline_hazard_ and baseline_survival_ respectively. These are computed using Breslow’s approximation. If you are interested in a _parametric_ baseline hazard, please see this issue.

### Parametric survival models¶

#### Accelerated failure time models¶

Suppose we have two populations, A and B, with different survival functions, $$S_A(t)$$ and $$S_B(t)$$, and they are related by some accelerated failure rate, $$\lambda$$:

$S_A(t) = S_B\left(\frac{t}{\lambda}\right)$

This can be interpreted as slowing down or speeding up moving along the survival function. A classic example of this is that dogs age at 7 times the rate of humans, i.e. $$\lambda = \frac{1}{7}$$. This model has some other nice properties: the average survival time of population B is $${\lambda}$$ times the average survival time of population A. Likewise with the median survival time.

More generally, we can model the $$\lambda$$ as a function of covariates available, that is:

$\begin{split}S_A(t) = S_B\left(\frac{t}{\lambda(x)}\right)\\ \lambda(x) = \exp\left(b_0 + \sum_{i=1}^n b_i x_i \right)\end{split}$

This model can accelerate or decelerate failure times depending on subjects’ covariates. Another nice feature of this is the ease of interpretation of the coefficients: a unit increase in $$x_i$$ means the average/median survival time changes by a factor of $$\exp(b_i)$$.

Note

An important note on interpretation: Suppose $$b_i$$ was positive, then the factor $$\exp(b_i)$$ is greater than 1, which will decelerate the event time since we divide time by the factor ⇿ increase mean/median survival. Hence, it will be a protective effect. Likewise, a negative $$b_i$$ will hasten the event time ⇿ reduce the mean/median survival time. This interpretation is opposite of how the sign influences event times in the Cox model! This is standard survival analysis convention.

Next, we pick a parametric form for the survival function, $$S(t)$$. The most common is the Weibull form. So if we assume the relationship above and a Weibull form, our hazard function is quite easy to write down:

$H(t; x) = \left( \frac{t}{\lambda(x)} \right)^\rho$

We call these accelerated failure time models, shortened often to just AFT models. Using lifelines, we can fit this model (and the unknown $$\rho$$ parameter too).

#### The Weibull AFT model¶

The Weibull AFT model is implemented under WeibullAFTFitter. The API for the class is similar to the other regression models in lifelines. After fitting, the coefficients can be accessed using params_ or summary, or alternatively printed using print_summary().

from lifelines import WeibullAFTFitter

aft = WeibullAFTFitter()
aft.fit(rossi_dataset, duration_col='week', event_col='arrest')

aft.print_summary(3)  # access the results using aft.summary

"""
<lifelines.WeibullAFTFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
number of subjects = 432
number of events = 114
log-likelihood = -679.917
time fit was run = 2019-02-20 17:47:19 UTC

---
coef  exp(coef)  se(coef)      z       p  -log2(p)  lower 0.95  upper 0.95
lambda_ fin         0.272      1.313     0.138  1.973   0.049     4.365       0.002       0.543
age         0.041      1.042     0.016  2.544   0.011     6.512       0.009       0.072
race       -0.225      0.799     0.220 -1.021   0.307     1.703      -0.656       0.207
wexp        0.107      1.112     0.152  0.703   0.482     1.053      -0.190       0.404
mar         0.311      1.365     0.273  1.139   0.255     1.973      -0.224       0.847
paro        0.059      1.061     0.140  0.421   0.674     0.570      -0.215       0.333
prio       -0.066      0.936     0.021 -3.143   0.002     9.224      -0.107      -0.025
_intercept  3.990     54.062     0.419  9.521 <0.0005    68.979       3.169       4.812
rho_    _intercept  0.339      1.404     0.089  3.809 <0.0005    12.808       0.165       0.514
---
Concordance = 0.640
Log-likelihood ratio test = 33.416 on 7 df, -log2(p)=15.462
"""


From above, we can see that prio, which is the number of previous incarcerations, has a large negative coefficient. This means that each addition incarcerations changes a subject’s mean/median survival time by $$\exp(-0.066) = 0.936$$, approximately a 7% decrease in mean/median survival time. What is the mean/median survival time?

print(aft.median_survival_time_)
print(aft.mean_survival_time_)

# 100.325
# 118.67


What does the rho_    _intercept row mean in the above table? Internally, we model the log of the rho_ parameter, so the value of $$\rho$$ is the exponential of the value, so in case above it’s $$\hat{\rho} = \exp0.339 = 1.404$$. This brings us to the next point - modelling $$\rho$$ with covariates as well:

#### Modeling ancillary parameters¶

In the above model, we left the parameter $$\rho$$ as a single unknown. We can also choose to model this parameter as well. Why might we want to do this? It can help in survival prediction to allow heterogeneity in the $$\rho$$ parameter. The model is no longer an AFT model, but we can still recover and understand the influence of changing a covariate by looking at its outcome plot (see section below). To model $$\rho$$, we use the ancillary_df keyword argument in the call to fit(). There are four valid options:

1. False or None: explicitly do not model the rho_ parameter (except for its intercept).
2. a Pandas DataFrame. This option will use the columns in the Pandas DataFrame as the covariates in the regression for rho_. This DataFrame could be a equal to, or a subset of, the original dataset using for modeling lambda_, or it could be a totally different dataset.
3. True. Passing in True will internally reuse the dataset that is being used to model lambda_.
aft = WeibullAFTFitter()

aft.fit(rossi, duration_col='week', event_col='arrest', ancillary_df=False)
# identical to aft.fit(rossi, duration_col='week', event_col='arrest', ancillary_df=None)

aft.fit(rossi, duration_col='week', event_col='arrest', ancillary_df=some_df)

aft.fit(rossi, duration_col='week', event_col='arrest', ancillary_df=True)
# identical to aft.fit(rossi, duration_col='week', event_col='arrest', ancillary_df=rossi)

aft.print_summary()

"""
<lifelines.WeibullAFTFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
number of subjects = 432
number of events = 114
log-likelihood = -669.40
time fit was run = 2019-02-20 17:42:55 UTC

---
coef  exp(coef)  se(coef)     z      p  -log2(p)  lower 0.95  upper 0.95
lambda_ fin         0.24       1.28      0.15  1.60   0.11      3.18       -0.06        0.55
age         0.10       1.10      0.03  3.43 <0.005     10.69        0.04        0.16
race        0.07       1.07      0.19  0.36   0.72      0.48       -0.30        0.44
wexp       -0.34       0.71      0.15 -2.22   0.03      5.26       -0.64       -0.04
mar         0.26       1.30      0.30  0.86   0.39      1.35       -0.33        0.85
paro        0.09       1.10      0.15  0.61   0.54      0.88       -0.21        0.39
prio       -0.08       0.92      0.02 -4.24 <0.005     15.46       -0.12       -0.04
_intercept  2.68      14.65      0.60  4.50 <0.005     17.14        1.51        3.85
rho_    fin        -0.01       0.99      0.15 -0.09   0.92      0.11       -0.31        0.29
age        -0.05       0.95      0.02 -3.10 <0.005      9.01       -0.08       -0.02
race       -0.46       0.63      0.25 -1.79   0.07      3.77       -0.95        0.04
wexp        0.56       1.74      0.17  3.32 <0.005     10.13        0.23        0.88
mar         0.10       1.10      0.27  0.36   0.72      0.47       -0.44        0.63
paro        0.02       1.02      0.16  0.12   0.90      0.15       -0.29        0.33
prio        0.03       1.03      0.02  1.44   0.15      2.73       -0.01        0.08
_intercept  1.48       4.41      0.41  3.60 <0.005     11.62        0.68        2.29
---
Concordance = 0.63
Log-likelihood ratio test = 54.45 on 14 df, -log2(p)=19.83
"""


#### Plotting¶

The plotting API is the same as in CoxPHFitter. We can view all covariates in a forest plot:

wft = WeibullAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=True)
wft.plot() We can observe the influence a variable in the model by plotting the outcome (i.e. survival) of changing the variable. This is done using plot_covariate_groups(), and this is also a nice time to observe the effects of modeling rho_ vs keeping it fixed. Below we fit the Weibull model to the same dataset twice, but in the first model we model rho_ and in the second model we don’t. We when vary the prio (which is the number of prior arrests) and observe how the survival changes.

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

times = np.arange(0, 100)
wft_model_rho = WeibullAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=True, timeline=times)
wft_model_rho.plot_covariate_groups('prio', range(0, 16, 3), cmap='coolwarm', ax=ax)
ax.set_title("Modelling rho_")

wft_not_model_rho = WeibullAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=False, timeline=times)
wft_not_model_rho.plot_covariate_groups('prio', range(0, 16, 3), cmap='coolwarm', ax=ax)
ax.set_title("Not modelling rho_"); Comparing a few of these survival functions side by side:

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(7, 4))

wft_model_rho.plot_covariate_groups('prio', range(0, 16, 5), cmap='coolwarm', ax=ax, lw=2, plot_baseline=False)
wft_not_model_rho.plot_covariate_groups('prio', range(0, 16, 5), cmap='coolwarm', ax=ax, ls='--', lw=2, plot_baseline=False)
ax.get_legend().remove() You read more about and see other examples of the extensions to plot_covariate_groups()

#### Prediction¶

Given a new subject, we ask questions about their future survival? When are they likely to experience the event? What does their survival function look like? The WeibullAFTFitter is able to answer these. If we have modeled the ancillary covariates, we are required to include those as well:

X = rossi.loc[:10]

aft.predict_cumulative_hazard(X, ancillary_df=X)
aft.predict_survival_function(X, ancillary_df=X)
aft.predict_median(X, ancillary_df=X)
aft.predict_percentile(X, p=0.9, ancillary_df=X)
aft.predict_expectation(X, ancillary_df=X)


When predicting time remaining for uncensored individuals, you can use the conditional_after kwarg:

censored_X = rossi.loc[~rossi['arrest'].astype(bool)]
censored_subjects_last_obs = censored_X['week']

aft.predict_cumulative_hazard(censored_X, ancillary_df=censored_X, conditional_after=censored_subjects_last_obs)
aft.predict_survival_function(censored_X, ancillary_df=censored_X, conditional_after=censored_subjects_last_obs)
aft.predict_median(censored_X, ancillary_df=censored_X, conditional_after=censored_subjects_last_obs)
aft.predict_percentile(X, p=0.9, ancillary_df=censored_X, conditional_after=censored_subjects_last_obs)


There are two hyper-parameters that can be used to to achieve a better test score. These are penalizer and l1_ratio in the call to WeibullAFTFitter. The penalizer is similar to scikit-learn’s ElasticNet model, see their docs.

aft_with_elastic_penalty = WeibullAFTFitter(penalizer=4.0, l1_ratio=1.0)
aft_with_elastic_penalty.fit(rossi, 'week', 'arrest')
aft_with_elastic_penalty.predict_median(rossi)

aft_with_elastic_penalty.print_summary()

"""
<lifelines.WeibullAFTFitter: fitted with 432 observations, 318 censored>
duration col = 'week'
event col = 'arrest'
penalizer = 4.0
l1_ratio = 1.0
number of subjects = 432
number of events = 114
log-likelihood = -2710.95
time fit was run = 2019-02-20 19:53:29 UTC

---
coef  exp(coef)  se(coef)     z      p  -log2(p)  lower 0.95  upper 0.95
lambda_ fin         0.00       1.00      0.08  0.00   1.00      0.00       -0.15        0.15
age         0.13       1.14      0.01 12.27 <0.005    112.47        0.11        0.15
race        0.55       1.73      0.09  5.80 <0.005     27.16        0.36        0.73
wexp        0.00       1.00      0.09  0.00   1.00      0.00       -0.17        0.17
mar         0.00       1.00      0.14  0.01   0.99      0.01       -0.27        0.28
paro        0.00       1.00      0.08  0.01   0.99      0.01       -0.16        0.16
prio        0.00       1.00      0.01  0.00   1.00      0.00       -0.03        0.03
_intercept  0.00       1.00      0.19  0.00   1.00      0.00       -0.38        0.38
rho_    _intercept -0.00       1.00       nan   nan    nan       nan         nan         nan
---
Concordance = 0.60
Log-likelihood ratio test = -4028.65 on 7 df, -log2(p)=-0.00
"""


#### The log-normal and log-logistic AFT models¶

There are also the LogNormalAFTFitter and LogLogisticAFTFitter models, which instead of assuming that the survival time distribution is Weibull, we assume it is Log-Normal or Log-Logistic, respectively. They have identical APIs to the WeibullAFTFitter, but the parameter names are different.

from lifelines import LogLogisticAFTFitter
from lifelines import LogNormalAFTFitter

llf = LogLogisticAFTFitter().fit(rossi, 'week', 'arrest')
lnf = LogNormalAFTFitter().fit(rossi, 'week', 'arrest')


#### The piecewise-exponential regression and generalized gamma models¶

Another class of parametric models involves more flexible modeling of the hazard function. The PiecewiseExponentialRegressionFitter can model jumps in the hazard (think: the differences in “survival-of-staying-in-school” between 1st year, 2nd year, 3rd year, and 4th year students), and constant values between jumps. The ability to specify when these jumps occur, called breakpoints, offers modelers great flexibility. An example application involving customer churn is available in this notebook.

For a flexible and smooth parametric model, there is the GeneralizedGammaRegressionFitter. This model is actually a generalization of all the AFT models above (that is, specific values of its parameters represent another model ) - see docs for specific parameter values. The API is slightly different however, and looks more like how custom regression models are built (see next section on Custom Regression Models).

from lifelines import GeneralizedGammaRegressionFitter

df['constant'] = 1.

# this will regress df against all 3 parameters
ggf = GeneralizedGammaRegressionFitter().fit(df, 'week', 'arrest')
ggf.print_summary()

# if we only want to regress against the scale parameter, mu_
regressors = {
'mu_': rossi.columns,
'sigma_': ['constant'],
'lambda_': ['constant']
}

ggf = GeneralizedGammaRegressionFitter().fit(df, 'week', 'arrest', regressors=regressors)
ggf.print_summary()


#### Model selection for parametric models¶

Often, you don’t know a priori which parametric model to use. Each model has some assumptions built-in (not implemented yet in lifelines), but a quick and effective method is to compare the log-likelihoods for each fitted model. (Technically, we are comparing the AIC, but the number of parameters for each model is the same, so we can simply and just look at the log-likelihood). Generally, given the same dataset and number of parameters, a better fitting model has a larger log-likelihood. We can look at the log-likelihood for each fitted model and select the largest one.

from lifelines import LogLogisticAFTFitter, WeibullAFTFitter, LogNormalAFTFitter

llf = LogLogisticAFTFitter().fit(rossi, 'week', 'arrest')
lnf = LogNormalAFTFitter().fit(rossi, 'week', 'arrest')
wf = WeibullAFTFitter().fit(rossi, 'week', 'arrest')

print(llf.log_likelihood_)  # -679.938
print(lnf.log_likelihood_)  # -683.234
print(wf.log_likelihood_)   # -679.916, slightly the best model.

# with some heterogeneity in the ancillary parameters
ancillary_df = rossi[['prio']]
llf = LogLogisticAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=ancillary_df)
lnf = LogNormalAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=ancillary_df)
wf = WeibullAFTFitter().fit(rossi, 'week', 'arrest', ancillary_df=ancillary_df)

print(llf.log_likelihood_) # -678.94, slightly the best model.
print(lnf.log_likelihood_) # -680.39
print(wf.log_likelihood_)  # -679.60


#### Left, right and interval censored data¶

The parametric models have APIs that handle left and interval censored data, too. The API for them is different than the API for fitting to right censored data. Here’s an example with interval censored data.

from lifelines.datasets import load_diabetes

df['gender'] = df['gender'] == 'male'

"""
left  right  gender
1    24     27    True
2    22     22   False
3    37     39    True
4    20     20    True
5     1     16    True
"""

wf = WeibullAFTFitter().fit_interval_censoring(df, lower_bound_col='left', upper_bound_col='right')
wf.print_summary()

"""
<lifelines.WeibullAFTFitter: fitted with 731 observations, 136 censored>
event col = 'E'
number of subjects = 731
number of events = 595
log-likelihood = -2027.20
time fit was run = 2019-04-11 19:39:42 UTC

---
coef exp(coef)  se(coef)      z      p  -log2(p)  lower 0.95  upper 0.95
lambda_ gender      0.05      1.05      0.03   1.66   0.10      3.38       -0.01        0.10
_intercept  2.91     18.32      0.02 130.15 <0.005       inf        2.86        2.95
rho_    _intercept  1.04      2.83      0.03  36.91 <0.005    988.46        0.98        1.09
---
Log-likelihood ratio test = 2.74 on 1 df, -log2(p)=3.35
"""


Another example of using lifelines for interval censored data is located here.

#### Custom parametric regression models¶

lifelines has a very general syntax for creating your own parametric regression models. If you are looking to create your own custom models, see docs Custom Regression Models.

Warning

This implementation is still experimental.

Aalen’s Additive model is another regression model we can use. Like the Cox model, it defines the hazard rate, but instead of the linear model being multiplicative like the Cox model, the Aalen model is additive. Specifically:

$h(t|x) = b_0(t) + b_1(t) x_1 + ... + b_N(t) x_N$

Inference typically does not estimate the individual $$b_i(t)$$ but instead estimates $$\int_0^t b_i(s) \; ds$$ (similar to the estimate of the hazard rate using NelsonAalenFitter). This is important when interpreting plots produced.

For this exercise, we will use the regime dataset and include the categorical variables un_continent_name (eg: Asia, North America,…), the regime type (e.g., monarchy, civilian,…) and the year the regime started in, start_year. The estimator to fit unknown coefficients in Aalen’s additive model is located under AalenAdditiveFitter.

from lifelines import AalenAdditiveFitter


ctryname cowcode2 politycode un_region_name un_continent_name ehead leaderspellreg democracy regime start_year duration observed
Afghanistan 700 700 Southern Asia Asia Mohammad Zahir Shah Mohammad Zahir Shah.Afghanistan.1946.1952.Monarchy Non-democracy Monarchy 1946 7 1
Afghanistan 700 700 Southern Asia Asia Sardar Mohammad Daoud Sardar Mohammad Daoud.Afghanistan.1953.1962.Civilian Dict Non-democracy Civilian Dict 1953 10 1
Afghanistan 700 700 Southern Asia Asia Mohammad Zahir Shah Mohammad Zahir Shah.Afghanistan.1963.1972.Monarchy Non-democracy Monarchy 1963 10 1
Afghanistan 700 700 Southern Asia Asia Sardar Mohammad Daoud Sardar Mohammad Daoud.Afghanistan.1973.1977.Civilian Dict Non-democracy Civilian Dict 1973 5 0
Afghanistan 700 700 Southern Asia Asia Nur Mohammad Taraki Nur Mohammad Taraki.Afghanistan.1978.1978.Civilian Dict Non-democracy Civilian Dict 1978 1 0

I’m using the lovely library Patsy here to create a design matrix from my original DataFrame.

import patsy
X = patsy.dmatrix('un_continent_name + regime + start_year', data, return_type='dataframe')
X = X.rename(columns={'Intercept': 'baseline'})

print(X.columns.tolist())

['baseline',
'un_continent_name[T.Americas]',
'un_continent_name[T.Asia]',
'un_continent_name[T.Europe]',
'un_continent_name[T.Oceania]',
'regime[T.Military Dict]',
'regime[T.Mixed Dem]',
'regime[T.Monarchy]',
'regime[T.Parliamentary Dem]',
'regime[T.Presidential Dem]',
'start_year']


We have also included the coef_penalizer option. During the estimation, a linear regression is computed at each step. Often the regression can be unstable (due to high co-linearity or small sample sizes) – adding a penalizer term controls the stability. I recommend always starting with a small penalizer term – if the estimates still appear to be too unstable, try increasing it.

aaf = AalenAdditiveFitter(coef_penalizer=1.0, fit_intercept=False)


An instance of AalenAdditiveFitter includes a fit() method that performs the inference on the coefficients. This method accepts a pandas DataFrame: each row is an individual and columns are the covariates and two individual columns: a duration column and a boolean event occurred column (where event occurred refers to the event of interest - expulsion from government in this case)

X['T'] = data['duration']
X['E'] = data['observed']

aaf.fit(X, 'T', event_col='E')


After fitting, the instance exposes a cumulative_hazards_ DataFrame containing the estimates of $$\int_0^t b_i(s) \; ds$$:

aaf.cumulative_hazards_.head()

baseline un_continent_name[T.Americas] un_continent_name[T.Asia] un_continent_name[T.Europe] un_continent_name[T.Oceania] regime[T.Military Dict] regime[T.Mixed Dem] regime[T.Monarchy] regime[T.Parliamentary Dem] regime[T.Presidential Dem] start_year
-0.03447 -0.03173 0.06216 0.2058 -0.009559 0.07611 0.08729 -0.1362 0.04885 0.1285 0.000092
0.14278 -0.02496 0.11122 0.2083 -0.079042 0.11704 0.36254 -0.2293 0.17103 0.1238 0.000044
0.30153 -0.07212 0.10929 0.1614 0.063030 0.16553 0.68693 -0.2738 0.33300 0.1499 0.000004
0.37969 0.06853 0.15162 0.2609 0.185569 0.22695 0.95016 -0.2961 0.37351 0.4311 -0.000032
0.36749 0.20201 0.21252 0.2429 0.188740 0.25127 1.15132 -0.3926 0.54952 0.7593 -0.000000

AalenAdditiveFitter also has built in plotting:

aaf.plot(columns=['regime[T.Presidential Dem]', 'baseline', 'un_continent_name[T.Europe]'], iloc=slice(1,15)) Regression is most interesting if we use it on data we have not yet seen, i.e., prediction! We can use what we have learned to predict individual hazard rates, survival functions, and median survival time. The dataset we are using is available up until 2008, so let’s use this data to predict the duration of former Canadian Prime Minister Stephen Harper.

ix = (data['ctryname'] == 'Canada') & (data['start_year'] == 2006)
harper = X.loc[ix]
print("Harper's unique data point:")
print(harper)

Harper's unique data point:
baseline  un_continent_name[T.Americas]  un_continent_name[T.Asia] ...  start_year  T  E
268       1.0                            1.0                        0.0 ...      2006.0  3  0

ax = plt.subplot(2,1,1)
aaf.predict_cumulative_hazard(harper).plot(ax=ax)

ax = plt.subplot(2,1,2)
aaf.predict_survival_function(harper).plot(ax=ax); Note

Because of the nature of the model, estimated survival functions of individuals can increase. This is an expected artifact of Aalen’s additive model.

### Model selection in survival regression¶

#### Parametric vs Semi-parametric models¶

Above, we’ve displayed two semi-parametric models (Cox model and Aalen’s model), and a family of parametric models. Which should you choose? What are the advantages and disadvantages of either? I suggest reading the two following StackExchange answers to get a better idea of what experts think:

#### Model selection based on residuals¶

The sections Testing the Proportional Hazard Assumptions and Assessing Cox model fit using residuals may be useful for modeling your data better.

Note

Work is being done to extend residual methods to AFT models. Stay tuned.

#### Model selection based on predictive power¶

If censoring is present, it’s not appropriate to use a loss function like mean-squared-error or mean-absolute-loss. Instead, one measure is the concordance-index, also known as the c-index. This measure evaluates the accuracy of the ranking of predicted time. It is in fact a generalization of AUC, another common loss function, and is interpreted similarly:

• 0.5 is the expected result from random predictions,
• 1.0 is perfect concordance and,
• 0.0 is perfect anti-concordance (multiply predictions with -1 to get 1.0)

Fitted survival models typically have a concordance index between 0.55 and 0.75 (this may seem bad, but even a perfect model has a lot of noise than can make a high score impossible). In lifelines, a fitted model’s concordance-index is present in the output of print_summary(), but also available under the score_ property. Generally, the measure is implemented in lifelines under lifelines.utils.concordance_index() and accepts the actual times (along with any censored subjects) and the predicted times.

from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(rossi, duration_col="week", event_col="arrest")

# Three ways to view the c-index:
# method one
cph.print_summary()

# method two
print(cph.score_)

# method three
from lifelines.utils import concordance_index
print(concordance_index(rossi['week'], -cph.predict_partial_hazard(rossi), rossi['arrest']))


Note

Remember, the concordance score evaluates the relative rankings of subject’s event times. Thus, it is scale and shift invariant (i.e. you can multiple by a positive constant, or add a constant, and the rankings won’t change). A model maximized for concordance-index does not necessarily give good predicted times, but will give good predicted rankings.

However, there are other, arguably better, methods to measure the fit of a model. Included in print_summary is the log-likelihood, which can be used in an AIC calculation, and the log-likelihood ratio statistic. Generally, I personally loved this article by Frank Harrell, “Statistically Efficient Ways to Quantify Added Predictive Value of New Measurements”.

lifelines has an implementation of k-fold cross validation under lifelines.utils.k_fold_cross_validation(). This function accepts an instance of a regression fitter (either CoxPHFitter of AalenAdditiveFitter), a dataset, plus k (the number of folds to perform, default 5). On each fold, it splits the data into a training set and a testing set fits itself on the training set and evaluates itself on the testing set (using the concordance measure by default).

from lifelines import CoxPHFitter
from lifelines.utils import k_fold_cross_validation

cph = CoxPHFitter()
scores = k_fold_cross_validation(cph, regression_dataset, 'T', event_col='E', k=3)
print(scores)
print(np.mean(scores))
print(np.std(scores))

#[ 0.5896  0.5358  0.5028]
# 0.542
# 0.035


Also, lifelines has wrappers for compatibility with scikit learn for making cross-validation and grid-search even easier. ## Custom regression models¶

Like for univariate models, it is possible to create your own custom parametric survival models. Why might you want to do this?

• Create new / extend AFT models using known probability distributions
• Create a piecewise model using domain knowledge about subjects
• Iterate and fit a more accurate parametric model

lifelines has a very simple API to create custom parametric regression models. You only need to define the cumulative hazard function. For example, the cumulative hazard for the constant-hazard regression model looks like:

$\begin{split}H(t, x) = \frac{t}{\lambda(x)}\\ \lambda(x) = \exp{(\vec{\beta} \cdot \vec{x}^{\,T})}\end{split}$

where $$\beta$$ are the unknowns we will optimize over.

Below are some example custom models.

:

from lifelines.fitters import ParametricRegressionFitter
from autograd import numpy as np

class ExponentialAFTFitter(ParametricRegressionFitter):

# this class property is necessary, and should always be a non-empty list of strings.
_fitted_parameter_names = ['lambda_']

def _cumulative_hazard(self, params, t, Xs):
# params is a dictionary that maps unknown parameters to a numpy vector.
# Xs is a dictionary that maps unknown parameters to a numpy 2d array
beta = params['lambda_']
X = Xs['lambda_']
lambda_ = np.exp(np.dot(X, beta))
return t / lambda_

rossi['intercept'] = 1.0

# the below variables maps dataframe columns to parameters
regressors = {
'lambda_': rossi.columns
}

eaf = ExponentialAFTFitter().fit(rossi, 'week', 'arrest', regressors=regressors)
eaf.print_summary()

model lifelines.ExponentialAFTFitter 'week' 'arrest' 432 114 -686.37 2020-01-21 22:13:30 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
lambda_ fin 0.37 1.44 0.19 -0.01 0.74 0.99 2.10 1.92 0.06 4.18
age 0.06 1.06 0.02 0.01 0.10 1.01 1.10 2.55 0.01 6.52
race -0.30 0.74 0.31 -0.91 0.30 0.40 1.35 -0.99 0.32 1.63
wexp 0.15 1.16 0.21 -0.27 0.56 0.76 1.75 0.69 0.49 1.03
mar 0.43 1.53 0.38 -0.32 1.17 0.73 3.24 1.12 0.26 1.93
paro 0.08 1.09 0.20 -0.30 0.47 0.74 1.59 0.42 0.67 0.57
prio -0.09 0.92 0.03 -0.14 -0.03 0.87 0.97 -3.03 <0.005 8.65
intercept 4.05 57.44 0.59 2.90 5.20 18.21 181.15 6.91 <0.005 37.61
Log-likelihood ratio test 31.22 on 6 df, -log2(p)=15.41
:

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

class DependentCompetingRisksHazard(ParametricRegressionFitter):
"""

Reference
--------------
Frees and Valdez, UNDERSTANDING RELATIONSHIPS USING COPULAS

"""

_fitted_parameter_names = ["lambda1", "rho1", "lambda2", "rho2", "alpha"]

def _cumulative_hazard(self, params, T, Xs):
lambda1 = np.exp(np.dot(Xs["lambda1"], params["lambda1"]))
lambda2 = np.exp(np.dot(Xs["lambda2"], params["lambda2"]))
rho2 =    np.exp(np.dot(Xs["rho2"],    params["rho2"]))
rho1 =    np.exp(np.dot(Xs["rho1"],    params["rho1"]))
alpha =   np.exp(np.dot(Xs["alpha"],   params["alpha"]))

return ((T / lambda1) ** rho1 + (T / lambda2) ** rho2) ** alpha

fitter = DependentCompetingRisksHazard(penalizer=0.1)

rossi["intercept"] = 1.0
rossi["week"] = rossi["week"] / rossi["week"].max() # scaling often helps with convergence

covariates = {
"lambda1": rossi.columns,
"lambda2": rossi.columns,
"rho1": ["intercept"],
"rho2": ["intercept"],
"alpha": ["intercept"],
}

fitter.fit(rossi, "week", event_col="arrest", regressors=covariates, timeline=np.linspace(0, 2))
fitter.print_summary(2)

ax = fitter.plot()

ax = fitter.predict_survival_function(rossi.loc[::100]).plot(figsize=(8, 4))
ax.set_title("Predicted survival functions for selected subjects")

model lifelines.DependentCompetingRisksHazard 'week' 'arrest' 0.1 432 114 -227.75 2020-01-21 22:18:10 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
lambda1 fin 0.02 1.02 0.22 -0.40 0.44 0.67 1.55 0.09 0.93 0.11
age 0.01 1.01 0.00 0.01 0.02 1.01 1.02 3.34 <0.005 10.25
race 0.05 1.05 0.17 -0.29 0.39 0.75 1.48 0.31 0.76 0.40
wexp -0.21 0.81 0.04 -0.29 -0.13 0.75 0.87 -5.44 <0.005 24.18
mar 0.18 1.20 0.19 -0.19 0.56 0.83 1.75 0.96 0.34 1.56
paro 0.02 1.02 0.12 -0.20 0.25 0.82 1.28 0.20 0.84 0.25
prio -0.01 0.99 0.01 -0.03 0.00 0.97 1.00 -1.35 0.18 2.49
intercept -0.00 1.00 0.03 -0.06 0.06 0.94 1.06 -0.08 0.94 0.10
lambda2 fin 0.21 1.23 0.20 -0.19 0.61 0.83 1.84 1.03 0.30 1.72
age 0.02 1.02 0.01 0.00 0.04 1.00 1.04 2.05 0.04 4.64
race -0.21 0.81 0.22 -0.64 0.21 0.53 1.24 -0.98 0.33 1.62
wexp 0.24 1.27 0.13 -0.02 0.49 0.98 1.64 1.79 0.07 3.77
mar 0.17 1.18 0.21 -0.25 0.58 0.78 1.79 0.78 0.43 1.20
paro 0.01 1.01 0.14 -0.27 0.29 0.76 1.33 0.06 0.95 0.07
prio -0.07 0.93 0.02 -0.11 -0.03 0.90 0.97 -3.22 <0.005 9.59
intercept 0.61 1.84 0.07 0.48 0.74 1.61 2.09 9.14 <0.005 63.82
rho1 intercept 2.87 17.69 0.39 2.12 3.63 8.31 37.63 7.46 <0.005 43.38
rho2 intercept 0.30 1.35 0.66 -0.99 1.59 0.37 4.89 0.46 0.65 0.63
alpha intercept -0.00 1.00 0.14 -0.29 0.28 0.75 1.32 -0.03 0.98 0.04
Log-likelihood ratio test 36.86 on 17 df, -log2(p)=8.15
:

Text(0.5, 1.0, 'Predicted survival functions for selected subjects')


### Cure models¶

Suppose in our population we have a subpopulation that will never experience the event of interest. Or, for some subjects the event will occur so far in the future that it’s essentially at time infinity. In this case, the survival function for an individual should not asymptically approach zero, but some positive value. Models that describe this are sometimes called cure models (i.e. the subject is “cured” of death and hence no longer susceptible) or time-lagged conversion models.

It would be nice to be able to use common survival models and have some “cure” component. Let’s suppose that for individuals that will experience the event of interest, their survival distrubtion is a Weibull, denoted $$S_W(t)$$. For a random selected individual in the population, thier survival curve, $$S(t)$$, is:

\begin{split}\begin{align*} S(t) = P(T > t) &= P(\text{cured}) P(T > t\;|\;\text{cured}) + P(\text{not cured}) P(T > t\;|\;\text{not cured}) \\ &= p + (1-p) S_W(t) \end{align*}\end{split}

Even though it’s in an unconvential form, we can still determine the cumulative hazard (which is the negative logarithm of the survival function):

$H(t) = -\log{\left(p + (1-p) S_W(t)\right)}$
:

from autograd.scipy.special import expit

class CureModel(ParametricRegressionFitter):
_scipy_fit_method = "SLSQP"
_scipy_fit_options = {"ftol": 1e-10, "maxiter": 200}

_fitted_parameter_names = ["lambda_", "beta_", "rho_"]

def _cumulative_hazard(self, params, T, Xs):
c = expit(np.dot(Xs["beta_"], params["beta_"]))

lambda_ = np.exp(np.dot(Xs["lambda_"], params["lambda_"]))
rho_ = np.exp(np.dot(Xs["rho_"], params["rho_"]))
sf = np.exp(-(T / lambda_) ** rho_)

return -np.log((1 - c) + c * sf)

cm = CureModel(penalizer=0.0)

rossi["intercept"] = 1.0

covariates = {"lambda_": rossi.columns, "rho_": ["intercept"], "beta_": ['intercept', 'fin']}

cm.fit(rossi, "week", event_col="arrest", regressors=covariates, timeline=np.arange(250))
cm.print_summary(2)

model lifelines.CureModel 'week' 'arrest' 432 114 -679.51 2020-01-21 22:14:20 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
lambda_ fin 0.15 1.16 1.09 -1.99 2.29 0.14 9.89 0.14 0.89 0.17
age 0.03 1.04 0.16 -0.28 0.35 0.76 1.42 0.22 0.83 0.27
race -0.28 0.75 0.58 -1.43 0.86 0.24 2.37 -0.49 0.63 0.67
wexp 0.20 1.23 0.20 -0.19 0.60 0.82 1.82 1.01 0.31 1.67
mar 0.34 1.41 0.23 -0.10 0.79 0.90 2.20 1.50 0.13 2.92
paro 0.03 1.04 0.18 -0.32 0.39 0.73 1.47 0.19 0.85 0.24
prio -0.08 0.93 0.03 -0.13 -0.02 0.87 0.98 -2.52 0.01 6.42
intercept 3.76 43.15 0.07 3.62 3.91 37.27 49.95 50.39 <0.005 inf
rho_ intercept 0.43 1.54 0.07 0.30 0.56 1.35 1.75 6.37 <0.005 32.28
beta_ fin -0.41 0.66 0.98 -2.34 1.52 0.10 4.56 -0.42 0.68 0.56
intercept 0.70 2.02 0.12 0.47 0.93 1.60 2.54 5.95 <0.005 28.50
Log-likelihood ratio test 34.22 on 9 df, -log2(p)=13.58
:

cm.predict_survival_function(rossi.loc[::100]).plot(figsize=(12,6))

:

<matplotlib.axes._subplots.AxesSubplot at 0x1278726a0>

:

# what's the effect on the survival curve if I vary "age"
fig, ax = plt.subplots(figsize=(12, 6))

cm.plot_covariate_groups(['age'], values=np.arange(20, 50, 5), cmap='coolwarm', ax=ax)

:

<matplotlib.axes._subplots.AxesSubplot at 0x11167bb38>


### Spline models¶

See royston_parmar_splines.py in the examples folder: https://github.com/CamDavidsonPilon/lifelines/tree/master/examples ## Compatibility with scikit-learn¶

New to lifelines in version 0.21.3 is a wrapper that allows you to use lifeline’s regression models with scikit-learn’s APIs.

Note

the API and functionality is still experimental. Please report any bugs or features on our Github issue list.

from lifelines.utils.sklearn_adapter import sklearn_adapter

from lifelines import CoxPHFitter

X = load_rossi().drop('week', axis=1) # keep as a dataframe

# CoxRegression is a class like the LinearRegression class or SVC class in scikit-learn

sk_cph = CoxRegression(penalizer=1.0)
sk_cph.fit(X, Y)
print(sk_cph)

"""
SkLearnCoxPHFitter(alpha=0.05, penalizer=1.0, strata=None, tie_method='Efron')
"""

sk_cph.predict(X)
sk_cph.score(X, Y)


Note

The X variable still needs to be a DataFrame, and should contain the event-occurred column (event_col) if it exists.

If needed, the original lifeline’s instance is available as the lifelines_model attribute.

sk_cph.lifelines_model.print_summary()


The wrapped classes can even be used in more complex scikit-learn functions (ex: cross_val_score) and classes (ex: GridSearchCV):

from lifelines import WeibullAFTFitter
from sklearn.model_selection import cross_val_score

wf = base_class()

scores = cross_val_score(wf, X, Y, cv=5)
print(scores)

"""
[0.59037328 0.503427   0.55454545 0.59689534 0.62311068]
"""

from sklearn.model_selection import GridSearchCV

clf = GridSearchCV(wf, {
"penalizer": 10.0 ** np.arange(-2, 3),
"l1_ratio": [0, 1/3, 2/3],
"model_ancillary": [True, False],
}, cv=4)
clf.fit(X, Y)

print(clf.best_estimator_)

"""
SkLearnWeibullAFTFitter(alpha=0.05, fit_intercept=True,
l1_ratio=0.66666, model_ancillary=True,
penalizer=0.01)

"""


Note

The lifelines.utils.sklearn_adapter() is currently only designed to work with right-censored data.

## Time varying survival regression¶

### Cox’s time varying proportional hazard model¶

Often an individual will have a covariate change over time. An example of this is hospital patients who enter the study and, at some future time, may receive a heart transplant. We would like to know the effect of the transplant, but we must be careful if we condition on whether they received the transplant. Consider that if patients needed to wait at least 1 year before getting a transplant, then everyone who dies before that year is considered as a non-transplant patient, and hence this would overestimate the hazard of not receiving a transplant.

We can incorporate changes over time into our survival analysis by using a modification of the Cox model above. The general mathematical description is:

$h(t | x) = \overbrace{b_0(t)}^{\text{baseline}}\underbrace{\exp \overbrace{\left(\sum_{i=1}^n \beta_i (x_i(t) - \overline{x_i}) \right)}^{\text{log-partial hazard}}}_ {\text{partial hazard}}$

Note the time-varying $$x_i(t)$$ to denote that covariates can change over time. This model is implemented in lifelines as CoxTimeVaryingFitter. The dataset schema required is different than previous models, so we will spend some time describing this.

#### Dataset creation for time-varying regression¶

lifelines requires that the dataset be in what is called the long format. This looks like one row per state change, including an ID, the left (exclusive) time point, and right (inclusive) time point. For example, the following dataset tracks three unique subjects.

id start stop group z event
1 0 8 1 0 False
2 0 5 0 0 False
2 5 8 0 1 True
3 0 3 1 0 False
3 3 12 1 1 True

In the above dataset, start and stop denote the boundaries, id is the unique identifier per subject, and event denotes if the subject died at the end of that period. For example, subject ID 2 had variable z=0 up to and including the end of time period 5 (we can think that measurements happen at end of the time period), after which it was set to 1. Since event is 1 in that row, we conclude that the subject died at time 8,

This desired dataset can be built up from smaller datasets. To do this we can use some helper functions provided in lifelines. Typically, data will be in a format that looks like it comes out of a relational database. You may have a “base” table with ids, durations alive, and a censored flag, and possibly static covariates. Ex:

id duration event var1
1 10 True 0.1
2 12 False 0.5

We will perform a light transform to this dataset to modify it into the “long” format.

from lifelines.utils import to_long_format

base_df = to_long_format(base_df, duration_col="duration")


The new dataset looks like:

id start stop var1 event
1 0 10 0.1 True
2 0 12 0.5 False

You’ll also have secondary dataset that references future measurements. This could come in two “types”. The first is when you have a variable that changes over time (ex: administering varying medication over time, or taking a tempature over time). The second types is an event-based dataset: an event happens at some time in the future (ex: an organ transplant occurs, or an intervention). We will address this second type later. The first type of dataset may look something like:

Example:

id time var2
1 0 1.4
1 4 1.2
1 8 1.5
2 0 1.6

where time is the duration from the entry event. Here we see subject 1 had a change in their var2 covariate at the end of time 4 and at the end of time 8. We can use lifelines.utils.add_covariate_to_timeline() to fold the covariate dataset into the original dataset.

from lifelines.utils import add_covariate_to_timeline

df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="event")

id start stop var1 var2 event
1 0 4 0.1 1.4 False
1 4 8 0.1 1.2 False
1 8 10 0.1 1.5 True
2 0 12 0.5 1.6 False

From the above output, we can see that subject 1 changed state twice over the observation period, finally expiring at the end of time 10. Subject 2 was a censored case, and we lost track of them after time 12.

You may have multiple covariates you wish to add, so the above could be streamlined like so:

from lifelines.utils import add_covariate_to_timeline

df = base_df.pipe(add_covariate_to_timeline, cv1, duration_col="time", id_col="id", event_col="event")\


If your dataset is of the second type, that is, event-based, your dataset may look something like the following, where values in the matrix denote times since the subject’s birth, and None or NaN represent the event not happening (subjects can be excluded if the event never occurred as well) :

print(event_df)

id    E1
0   1     1.0
1   2     NaN
2   3     3.0
...


Initially, this can’t be added to our baseline DataFrame. However, using lifelines.utils.covariates_from_event_matrix() we can convert a DataFrame like this into one that can be easily added.

from lifelines.utils import covariates_from_event_matrix

cv = covariates_from_event_matrix(event_df, id_col="id")
print(cv)

event  id  duration  E1
0       1       1.0   1
1       3       3.0   1
...

base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E")


For an example of pulling datasets like this from a SQL-store, and other helper functions, see Example SQL queries and transformations to get time varying data.

#### Cumulative sums¶

One additional flag on add_covariate_to_timeline() that is of interest is the cumulative_sum flag. By default it is False, but turning it to True will perform a cumulative sum on the covariate before joining. This is useful if the covariates describe an incremental change, instead of a state update. For example, we may have measurements of drugs administered to a patient, and we want the covariate to reflect how much we have administered since the start. Event columns do make sense to cumulative sum as well. In contrast, a covariate to measure the temperature of the patient is a state update, and should not be summed. See Example cumulative sums over time-varying covariates to see an example of this.

#### Delaying time-varying covariates¶

add_covariate_to_timeline() also has an option for delaying, or shifting, a covariate so it changes later than originally observed. One may ask, why should one delay a time-varying covariate? Here’s an example. Consider investigating the impact of smoking on mortality and available to us are time-varying observations of how many cigarettes are consumed each month. Unbeknown-st to us, when a subject reaches critical illness levels, they are admitted to the hospital and their cigarette consumption drops to zero. Some expire while in hospital. If we used this dataset naively, we would see that not smoking leads to sudden death, and conversely, smoking helps your health! This is a case of reverse causation: the upcoming death event actually influences the covariates.

To handle this, you can delay the observations by time periods:

from lifelines.utils import covariates_from_event_matrix

base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E", delay=14)


#### Fitting the model¶

Once your dataset is in the correct orientation, we can use CoxTimeVaryingFitter to fit the model to your data. The method is similar to CoxPHFitter, expect we need to tell the fit() about the additional time columns.

Fitting the Cox model to the data involves using gradient descent. lifelines takes extra effort to help with convergence, so please be attentive to any warnings that appear. Fixing any warnings will generally help convergence. For further help, see Problems with convergence in the Cox proportional hazard model.

from lifelines import CoxTimeVaryingFitter

ctv = CoxTimeVaryingFitter()
ctv.fit(df, id_col="id", event_col="event", start_col="start", stop_col="stop", show_progress=True)
ctv.print_summary()
ctv.plot()


#### Short note on prediction¶

Unlike the other regression models, prediction in a time-varying setting is not trivial. To predict, we would need to know the covariates values beyond the observed times, but if we knew that, we would also know if the subject was still alive or not! However, it is still possible to compute the hazard values of subjects at known observations, the baseline cumulative hazard rate, and baseline survival function. So while CoxTimeVaryingFitter exposes prediction methods, there are logical limitations to what these predictions mean. :

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from matplotlib import pyplot as plt
from lifelines import CoxPHFitter
import numpy as np
import pandas as pd


## Testing the proportional hazard assumptions¶

This Jupyter notebook is a small tutorial on how to test and fix proportional hazard problems.

The proportional hazard assumption is that all individuals have the same hazard function, but a unique scaling factor infront. So the shape of the hazard function is the same for all individuals, and only a scalar infront changes.

$h_i(t) = a_i h(t)$

At the core of the assumption is that $$a_i$$ is not time varying, that is, $$a_i(t) = a_i$$. Further more, if we take the ratio of this with another subject (called the hazard ratio):

$\frac{h_i(t)}{h_j(t)} = \frac{a_i h(t)}{a_j h(t)} = \frac{a_i}{a_j}$

is constant for all $$t$$. In this tutorial we will test this non-time varying assumption, and look at ways to handle violations.

:

from lifelines.datasets import load_rossi
cph = CoxPHFitter()

cph.fit(rossi, 'week', 'arrest')

:

<lifelines.CoxPHFitter:"None", fitted with 432 total observations, 318 right-censored observations>

:

cph.print_summary(model="untransformed variables", decimals=3)

model lifelines.CoxPHFitter 'week' 'arrest' 432 114 -658.748 2019-11-17 14:00:03 UTC untransformed variables
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
fin -0.379 0.684 0.191 -0.755 -0.004 0.470 0.996 -1.983 0.047 4.398
age -0.057 0.944 0.022 -0.101 -0.014 0.904 0.986 -2.611 0.009 6.791
race 0.314 1.369 0.308 -0.290 0.918 0.748 2.503 1.019 0.308 1.698
wexp -0.150 0.861 0.212 -0.566 0.266 0.568 1.305 -0.706 0.480 1.058
mar -0.434 0.648 0.382 -1.182 0.315 0.307 1.370 -1.136 0.256 1.965
paro -0.085 0.919 0.196 -0.469 0.299 0.626 1.348 -0.434 0.665 0.589
prio 0.091 1.096 0.029 0.035 0.148 1.036 1.159 3.194 0.001 9.476

### Checking assumptions with check_assumptions¶

New to lifelines 0.16.0 is the CoxPHFitter.check_assumptions method. This method will compute statistics that check the proportional hazard assumption, produce plots to check assumptions, and more. Also included is an option to display advice to the console. Here’s a breakdown of each information displayed:

• Presented first are the results of a statistical test to test for any time-varying coefficients. A time-varying coefficient imply a covariate’s influence relative to the baseline changes over time. This implies a violation of the proportional hazard assumption. For each variable, we transform time four times (these are common transformations of time to perform). If lifelines rejects the null (that is, lifelines rejects that the coefficient is not time-varying), we report this to the user.
• Some advice is presented on how to correct the proportional hazard violation based on some summary statistics of the variable.
• As a compliment to the above statistical test, for each variable that violates the PH assumption, visual plots of the the scaled Schoenfeld residuals is presented against the four time transformations. A fitted lowess is also presented, along with 10 bootstrapped lowess lines (as an approximation to the confidence interval of the original lowess line). Ideally, this lowess line is constant (flat). Deviations away from the constant line are violations of the PH assumption.

#### Why the scaled Schoenfeld residuals?¶

This section can be skipped on first read. Let $$s_{t,j}$$ denote the scaled Schoenfeld residuals of variable $$j$$ at time $$t$$, $$\hat{\beta_j}$$ denote the maximum-likelihood estimate of the $$j$$th variable, and $$\beta_j(t)$$ a time-varying coefficient in (fictional) alternative model that allows for time-varying coefficients. Therneau and Grambsch showed that.

$E[s_{t,j}] + \hat{\beta_j} = \beta_j(t)$

The proportional hazard assumption implies that $$\hat{\beta_j} = \beta_j(t)$$, hence $$E[s_{t,j}] = 0$$. This is what the above proportional hazard test is testing. Visually, plotting $$s_{t,j}$$ over time (or some transform of time), is a good way to see violations of $$E[s_{t,j}] = 0$$, along with the statisical test.

:

cph.check_assumptions(rossi, p_value_threshold=0.05, show_plots=True)

The p_value_threshold is set at 0.05. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.

With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using check_assumptions(..., show_plots=True)
and looking for non-constant lines. See link [A] below for a full example.

<lifelines.StatisticalResult>
test_name = proportional_hazard_test
null_distribution = chi squared
degrees_of_freedom = 1

---
test_statistic      p  -log2(p)
age  km             11.03 <0.005     10.12
rank           11.09 <0.005     10.17
fin  km              0.02   0.89      0.17
rank            0.02   0.90      0.16
mar  km              0.60   0.44      1.19
rank            0.67   0.41      1.27
paro km              0.12   0.73      0.45
rank            0.14   0.71      0.49
prio km              0.02   0.88      0.18
rank            0.02   0.88      0.18
race km              1.44   0.23      2.12
rank            1.46   0.23      2.14
wexp km              7.48   0.01      7.32
rank            7.18   0.01      7.08

1. Variable 'age' failed the non-proportional test: p-value is 0.0009.

Advice 1: the functional form of the variable 'age' might be incorrect. That is, there may be
non-linear terms missing. The proportional hazard test used is very sensitive to incorrect
functional forms. See documentation in link [D] below on how to specify a functional form.

Advice 2: try binning the variable 'age' using pd.cut, and then specify it in strata=['age',
...] in the call in .fit. See documentation in link [B] below.

below.

2. Variable 'wexp' failed the non-proportional test: p-value is 0.0063.

Advice: with so few unique values (only 2), you can include strata=['wexp', ...] in the call in
.fit. See documentation in link [E] below.

---



Alternatively, you can use the proportional hazard test outside of check_assumptions:

:

from lifelines.statistics import proportional_hazard_test

results = proportional_hazard_test(cph, rossi, time_transform='rank')
results.print_summary(decimals=3, model="untransformed variables")

<lifelines.StatisticalResult>
test_name = proportional_hazard_test
time_transform = rank
null_distribution = chi squared
degrees_of_freedom = 1
model = untransformed variables

---
test_statistic     p  -log2(p)
age           11.094 0.001    10.173
fin            0.017 0.896     0.158
mar            0.666 0.414     1.271
paro           0.138 0.711     0.493
prio           0.023 0.881     0.183
race           1.462 0.227     2.141
wexp           7.180 0.007     7.084


#### Stratification¶

In the advice above, we can see that wexp has small cardinality, so we can easily fix that by specifying it in the strata. What does the strata do? Let’s go back to the proportional hazard assumption.

In the introduction, we said that the proportional hazard assumption was that

$h_i(t) = a_i h(t)$

In a simple case, it may be that there are two subgroups that have very different baseline hazards. That is, we can split the dataset into subsamples based on some variable (we call this the stratifying variable), run the Cox model on all subsamples, and compare their baseline hazards. If these baseline hazards are very different, then clearly the formula above is wrong - the $$h(t)$$ is some weighted average of the subgroups’ baseline hazards. This ill fitting average baseline can cause $$a_i$$ to have time-dependent influence. A better model might be:

$h_{i |i\in G}(t) = a_i h_G(t)$

where now we have a unique baseline hazard per subgroup $$G$$. Because of the way the Cox model is designed, inference of the coefficients is identical (expect now there are more baseline hazards, and no variation of the stratifying variable within a subgroup $$G$$).

:

cph.fit(rossi, 'week', 'arrest', strata=['wexp'])
cph.print_summary(model="wexp in strata")

model lifelines.CoxPHFitter 'week' 'arrest' [wexp] 432 114 -580.89 2019-11-17 14:00:08 UTC wexp in strata
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
fin -0.38 0.68 0.19 -0.76 -0.01 0.47 0.99 -1.99 0.05 4.42
age -0.06 0.94 0.02 -0.10 -0.01 0.90 0.99 -2.64 0.01 6.91
race 0.31 1.36 0.31 -0.30 0.91 0.74 2.49 1.00 0.32 1.65
mar -0.45 0.64 0.38 -1.20 0.29 0.30 1.34 -1.19 0.23 2.09
paro -0.08 0.92 0.20 -0.47 0.30 0.63 1.35 -0.42 0.67 0.57
prio 0.09 1.09 0.03 0.03 0.15 1.04 1.16 3.16 <0.005 9.33
:

cph.check_assumptions(rossi, show_plots=True)

The p_value_threshold is set at 0.01. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.

With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using check_assumptions(..., show_plots=True)
and looking for non-constant lines. See link [A] below for a full example.

<lifelines.StatisticalResult>
test_name = proportional_hazard_test
null_distribution = chi squared
degrees_of_freedom = 1

---
test_statistic      p  -log2(p)
age  km             11.29 <0.005     10.32
rank            4.62   0.03      4.99
fin  km              0.02   0.90      0.16
rank            0.05   0.83      0.28
mar  km              0.53   0.47      1.10
rank            1.31   0.25      1.99
paro km              0.09   0.76      0.40
rank            0.00   0.97      0.05
prio km              0.02   0.89      0.16
rank            0.02   0.90      0.16
race km              1.47   0.23      2.15
rank            0.64   0.42      1.23

1. Variable 'age' failed the non-proportional test: p-value is 0.0008.

Advice 1: the functional form of the variable 'age' might be incorrect. That is, there may be
non-linear terms missing. The proportional hazard test used is very sensitive to incorrect
functional forms. See documentation in link [D] below on how to specify a functional form.

Advice 2: try binning the variable 'age' using pd.cut, and then specify it in strata=['age',
...] in the call in .fit. See documentation in link [B] below.

below.

---



Since age is still violating the proportional hazard assumption, we need to model it better. From the residual plots above, we can see a the effect of age start to become negative over time. This will be relevant later. Below, we present three options to handle age.

#### Modify the functional form¶

The proportional hazard test is very sensitive (i.e. lots of false positives) when the functional form of a variable is incorrect. For example, if the association between a covariate and the log-hazard is non-linear, but the model has only a linear term included, then the proportional hazard test can raise a false positive.

rossi['age**2'] = (rossi['age'] - rossi['age'].mean())**2
rossi['age**3'] = (rossi['age'] - rossi['age'].mean())**3


but I think a more correct way to include non-linear terms is to use splines. Both Patsy and zEpid provide functionality for splines (tutorial incoming), but let’s stick with the form above.

:

rossi_higher_order_age = rossi.copy()
rossi_higher_order_age['age'] = rossi_higher_order_age['age'] - rossi_higher_order_age['age'].mean()
rossi_higher_order_age['age**2'] = (rossi_higher_order_age['age'] - rossi_higher_order_age['age'].mean())**2
rossi_higher_order_age['age**3'] = (rossi_higher_order_age['age'] - rossi_higher_order_age['age'].mean())**3

cph.fit(rossi_higher_order_age, 'week', 'arrest', strata=['wexp'])
cph.print_summary(model="quad and cubic age terms"); print()
cph.check_assumptions(rossi_higher_order_age, show_plots=True, p_value_threshold=0.05)

model lifelines.CoxPHFitter 'week' 'arrest' [wexp] 432 114 -579.37 2019-11-17 14:00:10 UTC quad and cubic age terms
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
fin -0.37 0.69 0.19 -0.75 0.00 0.47 1.00 -1.93 0.05 4.24
age -0.06 0.94 0.03 -0.13 0.00 0.88 1.00 -1.85 0.06 3.95
race 0.35 1.42 0.31 -0.26 0.95 0.77 2.60 1.13 0.26 1.95
mar -0.39 0.68 0.38 -1.15 0.36 0.32 1.43 -1.02 0.31 1.70
paro -0.10 0.90 0.20 -0.49 0.28 0.61 1.33 -0.52 0.60 0.74
prio 0.09 1.10 0.03 0.04 0.15 1.04 1.16 3.22 <0.005 9.59
age**2 0.01 1.01 0.00 -0.00 0.02 1.00 1.02 1.57 0.12 3.09
age**3 -0.00 1.00 0.00 -0.00 0.00 1.00 1.00 -0.89 0.37 1.42

The p_value_threshold is set at 0.05. Even under the null hypothesis of no violations, some
covariates will be below the threshold by chance. This is compounded when there are many covariates.
Similarly, when there are lots of observations, even minor deviances from the proportional hazard
assumption will be flagged.

With that in mind, it's best to use a combination of statistical tests and visual tests to determine
the most serious violations. Produce visual plots using check_assumptions(..., show_plots=True)
and looking for non-constant lines. See link [A] below for a full example.

<lifelines.StatisticalResult>
test_name = proportional_hazard_test
null_distribution = chi squared
degrees_of_freedom = 1

---
test_statistic    p  -log2(p)
age    km              0.96 0.33      1.62
rank            4.09 0.04      4.54
age**2 km              1.81 0.18      2.48
rank            0.79 0.37      1.42
age**3 km              2.33 0.13      2.98
rank            0.03 0.87      0.19
fin    km              0.03 0.87      0.20
rank            0.02 0.90      0.15
mar    km              0.53 0.47      1.10
rank            0.94 0.33      1.59
paro   km              0.20 0.66      0.60
rank            0.01 0.93      0.10
prio   km              0.02 0.88      0.19
rank            0.01 0.90      0.15
race   km              1.28 0.26      1.96
rank            0.47 0.49      1.02

1. Variable 'age' failed the non-proportional test: p-value is 0.0431.

Advice 1: the functional form of the variable 'age' might be incorrect. That is, there may be
non-linear terms missing. The proportional hazard test used is very sensitive to incorrect
functional forms. See documentation in link [D] below on how to specify a functional form.

Advice 2: try binning the variable 'age' using pd.cut, and then specify it in strata=['age',
...] in the call in .fit. See documentation in link [B] below.

below.

---



We see we still have potentially some violation, but it’s a heck of a lot less. Also, interestingly, when we include these non-linear terms for age, the wexp proportionality violation disappears. It is not uncommon to see changing the functional form of one variable effects other’s proportional tests, usually positively. So, we could remove the strata=['wexp'] if we wished.

#### Bin variable and stratify on it¶

The second option proposed is to bin the variable into equal-sized bins, and stratify like we did with wexp. There is a trade off here between estimation and information-loss. If we have large bins, we will lose information (since different values are now binned together), but we need to estimate less new baseline hazards. On the other hand, with tiny bins, we allow the age data to have the most “wiggle room”, but must compute many baseline hazards each of which has a smaller sample size. Like most things, the optimial value is somewhere inbetween.

:

rossi_strata_age = rossi.copy()
rossi_strata_age['age_strata'] = pd.cut(rossi_strata_age['age'], np.arange(0, 80, 3))


:

age age_strata
0 27 (24, 27]
1 18 (15, 18]
2 19 (18, 21]
3 23 (21, 24]
4 19 (18, 21]
:

# drop the orignal, redundant, age column
rossi_strata_age = rossi_strata_age.drop('age', axis=1)
cph.fit(rossi_strata_age, 'week', 'arrest', strata=['age_strata', 'wexp'])

:

<lifelines.CoxPHFitter:"None", fitted with 432 total observations, 318 right-censored observations>

:

cph.print_summary(3, model="stratified age and wexp")
cph.plot()

model lifelines.CoxPHFitter 'week' 'arrest' [age_strata, wexp] 432 114 -392.443 2019-11-17 14:00:13 UTC stratified age and wexp
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
fin -0.395 0.674 0.197 -0.781 -0.009 0.458 0.991 -2.004 0.045 4.472
race 0.280 1.324 0.313 -0.334 0.895 0.716 2.447 0.895 0.371 1.431
mar -0.194 0.824 0.392 -0.961 0.574 0.382 1.776 -0.494 0.621 0.687
paro -0.163 0.849 0.200 -0.555 0.228 0.574 1.256 -0.818 0.413 1.275
prio 0.080 1.084 0.028 0.025 0.135 1.025 1.145 2.854 0.004 7.857
:

<matplotlib.axes._subplots.AxesSubplot at 0x105873d30>

:

cph.check_assumptions(rossi_strata_age)

Proportional hazard assumption looks okay.


#### Introduce time-varying covariates¶

Our second option to correct variables that violate the proportional hazard assumption is to model the time-varying component directly. This is done in two steps. The first is to transform your dataset into episodic format. This means that we split a subject from a single row into $$n$$ new rows, and each new row represents some time period for the subject. It’s okay that the variables are static over this new time periods - we’ll introduce some time-varying covariates later.

See below for how to do this in lifelines:

:

from lifelines.utils import to_episodic_format

# the time_gaps parameter specifies how large or small you want the periods to be.
rossi_long = to_episodic_format(rossi, duration_col='week', event_col='arrest', time_gaps=1.)


:

stop start arrest age fin id mar paro prio race wexp
0 1.0 0.0 0 27 0 0 0 1 3 1 0
1 2.0 1.0 0 27 0 0 0 1 3 1 0
2 3.0 2.0 0 27 0 0 0 1 3 1 0
3 4.0 3.0 0 27 0 0 0 1 3 1 0
4 5.0 4.0 0 27 0 0 0 1 3 1 0
5 6.0 5.0 0 27 0 0 0 1 3 1 0
6 7.0 6.0 0 27 0 0 0 1 3 1 0
7 8.0 7.0 0 27 0 0 0 1 3 1 0
8 9.0 8.0 0 27 0 0 0 1 3 1 0
9 10.0 9.0 0 27 0 0 0 1 3 1 0
10 11.0 10.0 0 27 0 0 0 1 3 1 0
11 12.0 11.0 0 27 0 0 0 1 3 1 0
12 13.0 12.0 0 27 0 0 0 1 3 1 0
13 14.0 13.0 0 27 0 0 0 1 3 1 0
14 15.0 14.0 0 27 0 0 0 1 3 1 0
15 16.0 15.0 0 27 0 0 0 1 3 1 0
16 17.0 16.0 0 27 0 0 0 1 3 1 0
17 18.0 17.0 0 27 0 0 0 1 3 1 0
18 19.0 18.0 0 27 0 0 0 1 3 1 0
19 20.0 19.0 1 27 0 0 0 1 3 1 0
20 1.0 0.0 0 18 0 1 0 1 8 1 0
21 2.0 1.0 0 18 0 1 0 1 8 1 0
22 3.0 2.0 0 18 0 1 0 1 8 1 0
23 4.0 3.0 0 18 0 1 0 1 8 1 0
24 5.0 4.0 0 18 0 1 0 1 8 1 0

Each subject is given a new id (but can be specified as well if already provided in the dataframe). This id is used to track subjects over time. Notice the arrest col is 0 for all periods prior to their (possible) event as well.

Above I mentioned there were two steps to correct age. The first was to convert to a episodic format. The second is to create an interaction term between age and stop. This is a time-varying variable.

Instead of CoxPHFitter, we must use CoxTimeVaryingFitter instead since we are working with a episodic dataset.

:

rossi_long['time*age'] = rossi_long['age'] * rossi_long['stop']

:

from lifelines import CoxTimeVaryingFitter
ctv = CoxTimeVaryingFitter()

ctv.fit(rossi_long,
id_col='id',
event_col='arrest',
start_col='start',
stop_col='stop',
strata=['wexp'])

:

<lifelines.CoxTimeVaryingFitter: fitted with 19809 periods, 432 subjects, 114 events>

:

ctv.print_summary(3, model="age * time interaction")

model lifelines.CoxTimeVaryingFitter 'arrest' [wexp] 432 19809 114 -575.080 2019-11-17 14:00:20 UTC age * time interaction
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
age 0.073 1.075 0.040 -0.005 0.151 0.995 1.163 1.830 0.067 3.893
fin -0.386 0.680 0.191 -0.760 -0.011 0.468 0.989 -2.018 0.044 4.520
mar -0.397 0.672 0.382 -1.147 0.352 0.318 1.422 -1.039 0.299 1.743
paro -0.098 0.907 0.196 -0.481 0.285 0.618 1.330 -0.501 0.616 0.698
prio 0.090 1.094 0.029 0.034 0.146 1.035 1.158 3.152 0.002 9.267
race 0.295 1.343 0.308 -0.310 0.899 0.733 2.458 0.955 0.340 1.558
time*age -0.005 0.995 0.002 -0.008 -0.002 0.992 0.998 -3.337 0.001 10.203
:

ctv.plot()

:

<matplotlib.axes._subplots.AxesSubplot at 0x1064d6e80>


In the above scaled Schoenfeld residual plots for age, we can see there is a slight negative effect for higher time values. This is confirmed in the output of the CoxTimeVaryingFitter: we see that the coefficient for time*age is -0.005.

#### Conclusion¶

The point estimates and the standard errors are very close to each other using either option, we can feel confident that either approach is okay to proceed. ## More examples and recipes¶

This section goes through some examples and recipes to help you use lifelines.

### Worked Examples¶

If you are looking for some full examples of lifelines, there are full Jupyter notebooks and scripts here and examples and ideas on the development blog.

### Statistically compare two populations¶

Often researchers want to compare survival-ness between different populations. Here are some techniques to do that:

#### Logrank test¶

Note

The logrank test has maximum power when the assumption of proportional hazards is true. As a consequence, if the survival functions cross, the logrank test will give an inaccurate assessment of differences.

The lifelines.statistics.logrank_test() function compares whether the “death” generation process of the two populations are equal:

from lifelines.statistics import logrank_test

results = logrank_test(T1, T2, event_observed_A=E1, event_observed_B=E2)
results.print_summary()

"""
t_0 = -1
alpha = 0.95
null_distribution = chi squared
df = 1
use_bonferroni = True

---
test_statistic        p
3.528  0.00034  **

"""

print(results.p_value)        # 0.46759
print(results.test_statistic) # 0.528


If you have more than two populations, you can use pairwise_logrank_test() (which compares each pair in the same manner as above), or multivariate_logrank_test() (which tests the hypothesis that all the populations have the same “death” generation process).

from lifelines.statistics import multivariate_logrank_test

df = pd.DataFrame({
'durations': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
'groups': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2], # could be strings too
'events': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
})

results = multivariate_logrank_test(df['durations'], df['groups'], df['events'])
results.print_summary()

"""
t_0 = -1
alpha = 0.95
null_distribution = chi squared
df = 2

---
test_statistic      p
1.0800 0.5827
---
"""


#### Survival differences at a point in time¶

Often analysts want to compare the survival-ness of groups at specific times, rather than comparing the entire survival curves against each other. For example, analysts may be interested in 5-year survival. Statistically comparing the naive Kaplan-Meier points at a specific time actually has reduced power. By transforming the Kaplan-Meier curve, we can recover more power. The function lifelines.statistics.survival_difference_at_fixed_point_in_time_test() uses the log(-log) transformation implicitly and compares the survival-ness of populations at a specific point in time.

from lifelines.statistics import survival_difference_at_fixed_point_in_time_test

results = survival_difference_at_fixed_point_in_time_test(point_in_time, T1, T2, event_observed_A=E1, event_observed_B=E2)
results.print_summary()


#### Subtraction and division between survival functions¶

If you are interested in taking the difference between two survival functions, simply trying to subtract the survival_function_ will likely fail if the DataFrame’s indexes are not equal. Fortunately, the lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter and lifelines.fitters.nelson_aalen_fitter.NelsonAalenFitter have a built-in subtract method:

kmf1.subtract(kmf2)


will produce the difference at every relevant time point. A similar function exists for division: divide. However, for rigorous testing of differences, lifelines comes with a statistics library. See below.

#### Restricted mean survival times (RMST)¶

lifelines has a function to accurately compute the restricted mean survival time, defined as

$\text{RMST}(t) = \int_0^t S(\tau) d\tau$

This is a good metric for comparing two survival curves, as their difference represents the area between the curves (see figure below). The upper limit is often finite because the tail of the estimated survival curve has high variance and can strongly influence the integral.

from lifelines.utils import restricted_mean_survival_time

ix = df['group'] == 'miR-137'
T, E = df['T'], df['E']

time_limit = 50

kmf_exp = KaplanMeierFitter().fit(T[ix], E[ix], label='exp')
rmst_exp = restricted_mean_survival_time(kmf_exp, t=time_limit)

kmf_con = KaplanMeierFitter().fit(T[~ix], E[~ix], label='control')
rmst_con = restricted_mean_survival_time(kmf_con, t=time_limit)


Furthermore, there exist plotting functions to plot the RMST:

from lifelines.plotting import rmst_plot
ax = plt.subplot(311)
rmst_plot(kmf_exp, t=time_limit, ax=ax)

ax = plt.subplot(312)
rmst_plot(kmf_con, t=time_limit, ax=ax)

ax = plt.subplot(313)
rmst_plot(kmf_exp, model2=kmf_con, t=time_limit, ax=ax) ### Model selection using lifelines¶

If using lifelines for prediction work, it’s ideal that you perform some type of cross-validation scheme. This cross-validation allows you to be confident that your out-of-sample predictions will work well in practice. It also allows you to choose between multiple models.

lifelines has a built-in k-fold cross-validation function. For example, consider the following example:

from lifelines import AalenAdditiveFitter, CoxPHFitter
from lifelines.utils import k_fold_cross_validation

#create the three models we'd like to compare.
cph = CoxPHFitter()

print(np.mean(k_fold_cross_validation(cph, df, duration_col='T', event_col='E')))
print(np.mean(k_fold_cross_validation(aaf_1, df, duration_col='T', event_col='E')))
print(np.mean(k_fold_cross_validation(aaf_2, df, duration_col='T', event_col='E')))


From these results, Aalen’s Additive model with a penalizer of 10 is best model of predicting future survival times.

lifelines also has wrappers to use scikit-learn’s cross validation and grid search tools. See how to use lifelines with scikit learn.

### Selecting a parametric model using QQ plots¶

QQ plots normally are constructed by sorting the values. However, this isn’t appropriate when there is censored data. In lifelines, there are routines to still create QQ plots with censored data. These are available under lifelines.plotting.qq_plots(), and accepts fitted a parametric lifelines model.

from lifelines import *
from lifelines.plotting import qq_plot

# generate some fake log-normal data
N = 1000
T_actual = np.exp(np.random.randn(N))
C = np.exp(np.random.randn(N))
E = T_actual < C
T = np.minimum(T_actual, C)

fig, axes = plt.subplots(2, 2, figsize=(8, 6))
axes = axes.reshape(4,)

for i, model in enumerate([WeibullFitter(), LogNormalFitter(), LogLogisticFitter(), ExponentialFitter()]):
model.fit(T, E)
qq_plot(model, ax=axes[i]) This graphical test can be used to invalidate models. For example, in the above figure, we can see that only the log-normal parametric model is appropriate (we expect deviance in the tails, but not too much). Another use case is choosing the correct parametric AFT model.

The qq_plots() also works with left censorship as well.

### Selecting a parametric model using AIC¶

For univariate models (later to be extended to regression models), a natural way to compare different models is the AIC:

$\text{AIC}(\text{model}) = -2 \text{ll} + 2k$

where $$k$$ is the number of parameters (degrees-of-freedom) of the model and $$\text{ll}$$ is the maximum log-likelihood. The model with the lowest AIC is desirable, since it’s a trade off between maximizing the log-likelihood with as few parameters as possible.

lifelines has a built in function to automate AIC comparisons between univariate parametric models:

from lifelines.utils import find_best_parametric_model

best_model, best_aic_ = find_best_parametric_model(T, E)

print(best_model)
# <lifelines.SplineFitter:"Spline_estimate", fitted with 686 total observations, 387 right-censored observations>

best_model.plot_hazard() ### Plotting multiple figures on a plot¶

When .plot is called, an axis object is returned which can be passed into future calls of .plot:

kmf.fit(data1)
ax = kmf.plot()

kmf.fit(data2)
ax = kmf.plot(ax=ax)


If you have a pandas DataFrame with columns “group”, “T”, and “E”, then something like the following would work:

from lifelines import KaplanMeierFitter
from matplotlib import pyplot as plt

ax = plt.subplot(111)

kmf = KaplanMeierFitter()

for name, grouped_df in df.groupby('group'):
kmf.fit(grouped_df["T"], grouped_df["E"], label=name)
kmf.plot(ax=ax)


### Plotting options and styles¶

from lifelines.datasets import load_waltons

T = waltons['T']
E = waltons['E']


#### Standard¶

kmf = KaplanMeierFitter()
kmf.fit(T, E, label="kmf.plot()")
kmf.plot() #### Show censors and edit markers¶

kmf.fit(T, E, label="kmf.plot(show_censors=True, \ncensor_styles={'ms': 6, 'marker': 's'})")
kmf.plot(show_censors=True, censor_styles={'ms': 6, 'marker': 's'}) #### Hide confidence intervals¶

kmf.fit(T, E, label="kmf.plot(ci_show=False)")
kmf.plot(ci_show=False) #### Displaying at-risk counts below plots¶

kmf.fit(T, E, label="label name")
kmf.plot(at_risk_counts=True) #### Displaying multiple at-risk counts below plots¶

The function add_at_risk_counts in lifelines.plotting allows you to add At-Risk counts at the bottom of your figures. For example:

from lifelines import KaplanMeierFitter

ix = waltons['group'] == 'control'

ax = plt.subplot(111)

kmf_control = KaplanMeierFitter()
ax = kmf_control.fit(waltons.loc[ix]['T'], waltons.loc[ix]['E'], label='control').plot(ax=ax)

kmf_exp = KaplanMeierFitter()
ax = kmf_exp.fit(waltons.loc[~ix]['T'], waltons.loc[~ix]['E'], label='exp').plot(ax=ax)



will display ### Transforming survival-table data into lifelines format¶

Some lifelines classes are designed for lists or arrays that represent one individual per row. If you instead have data in a survival table format, there exists a utility method to get it into lifelines format.

Example: Suppose you have a CSV file with data that looks like this:

time observed deaths censored
0 7 0
1 1 1
2 2 0
3 1 2
4 5 2
import pandas as pd
from lifelines.utils import survival_events_from_table

df = pd.read_csv('file.csv', columns = ['time', observed deaths', 'censored'])
df = df.set_index('time')

T, E, W = survival_events_from_table(df, observed_deaths_col='observed deaths', censored_col='censored')
# weights, W, is the number of occurrences of each observation - helps with data compression.

kmf = KaplanMeierFitter().fit(T, E, weights=W)


### Transforming observational data into survival-table format¶

Perhaps you are interested in viewing the survival table given some durations and censoring vectors.

from lifelines.utils import survival_table_from_events

table = survival_table_from_events(T, E)

"""
removed  observed  censored  entrance  at_risk
event_at
0               0         0         0        60       60
2               2         1         1         0       60
3               3         1         2         0       58
4               5         3         2         0       55
5              12         6         6         0       50
"""


### Set the index/timeline of a estimate¶

Suppose your dataset has lifetimes grouped near time 60, thus after fitting lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter, you survival function might look something like:

print(kmf.survival_function_)

KM-estimate
0          1.00
47         0.99
49         0.97
50         0.96
51         0.95
52         0.91
53         0.86
54         0.84
55         0.79
56         0.74
57         0.71
58         0.67
59         0.58
60         0.49
61         0.41
62         0.31
63         0.24
64         0.19
65         0.14
66         0.10
68         0.07
69         0.04
70         0.02
71         0.01
74         0.00


What you would like is to have a predictable and full index from 40 to 75. (Notice that in the above index, the last two time points are not adjacent – the cause is observing no lifetimes existing for times 72 or 73). This is especially useful for comparing multiple survival functions at specific time points. To do this, all fitter methods accept a timeline argument:

kmf.fit(T, timeline=range(40,75))
print(kmf.survival_function_)

KM-estimate
40         1.00
41         1.00
42         1.00
43         1.00
44         1.00
45         1.00
46         1.00
47         0.99
48         0.99
49         0.97
50         0.96
51         0.95
52         0.91
53         0.86
54         0.84
55         0.79
56         0.74
57         0.71
58         0.67
59         0.58
60         0.49
61         0.41
62         0.31
63         0.24
64         0.19
65         0.14
66         0.10
67         0.10
68         0.07
69         0.04
70         0.02
71         0.01
72         0.01
73         0.01
74         0.00


lifelines will intelligently forward-fill the estimates to unseen time points.

### Example SQL query to get survival data from a table¶

Below is a way to get an example dataset from a relational database (this may vary depending on your database):

SELECT
id,
DATEDIFF('dd', started_at, COALESCE(ended_at, CURRENT_DATE)) AS "T",
(ended_at IS NOT NULL) AS "E"
FROM table


#### Explanation¶

Each row is an id, a duration, and a boolean indicating whether the event occurred or not. Recall that we denote a “True” if the event did occur, that is, ended_at is filled in (we observed the ended_at). Ex:

id T E
10 40 True
11 42 False
12 42 False
13 36 True
14 33 True

### Example SQL queries and transformations to get time varying data¶

For Cox time-varying models, we discussed what the dataset should look like in Dataset creation for time-varying regression. Typically we have a base dataset, and then we fold in the covariate datasets. Below are some SQL queries and Python transformations from end-to-end.

#### Base dataset: base_df¶

SELECT
id,
group,
DATEDIFF('dd', dt.started_at, COALESCE(dt.ended_at, CURRENT_DATE)) AS "T",
(ended_at IS NOT NULL) AS "E"
FROM dimension_table dt


#### Time-varying variables: cv¶

-- this could produce more than 1 row per subject
SELECT
id,
DATEDIFF('dd', dt.started_at, ft.event_at) AS "time",
ft.var1
FROM fact_table ft
JOIN dimension_table dt
USING(id)

from lifelines.utils import to_long_format

base_df = to_long_format(base_df, duration_col="T")
df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E")


#### Event variables: event_df¶

Another very common operation is to add event data to our time-varying dataset. For example, a dataset/SQL table that contains information about the dates of an event (and NULLS if the event didn’t occur). An example SQL query may look like:

SELECT
id,
DATEDIFF('dd', dt.started_at, ft.event1_at) AS "E1",
DATEDIFF('dd', dt.started_at, ft.event2_at) AS "E2",
DATEDIFF('dd', dt.started_at, ft.event3_at) AS "E3"
...
FROM dimension_table dt


In Pandas, this may look like:

    id    E1      E2     E3
0   1     1.0     NaN    2.0
1   2     NaN     5.0    NaN
2   3     3.0     5.0    7.0
...


Initially, this can’t be added to our baseline time-varying dataset. Using lifelines.utils.covariates_from_event_matrix() we can convert a DataFrame like this into one that can be easily added.

from lifelines.utils import covariates_from_event_matrix

cv = covariates_from_event_matrix(event_df, id_col='id')
print(cv)

       id  duration  E1  E2  E3
0       1       1.0   1   0   0
1       1       2.0   0   1   0
2       2       5.0   0   1   0
3       3       3.0   1   0   0
4       3       5.0   0   1   0
5       3       7.0   0   0   1

base_df = add_covariate_to_timeline(base_df, cv, duration_col="time", id_col="id", event_col="E")


### Example cumulative sums over time-varying covariates¶

Often we have either transactional covariate datasets or state covariate datasets. In a transactional dataset, it may make sense to sum up the covariates to represent administration of a treatment over time. For example, in the risky world of start-ups, we may want to sum up the funding amount received at a certain time. We also may be interested in the amount of the last round of funding. Below is an example to do just that:

Suppose we have an initial DataFrame of start-ups like:

seed_df = pd.DataFrame([
{'id': 'FB', 'E': True, 'T': 12, 'funding': 0},
{'id': 'SU', 'E': True, 'T': 10, 'funding': 0},
])


And a covariate DataFrame representing funding rounds like:

cv = pd.DataFrame([
{'id': 'FB', 'funding': 30, 't': 5},
{'id': 'FB', 'funding': 15, 't': 10},
{'id': 'FB', 'funding': 50, 't': 15},
{'id': 'SU', 'funding': 10, 't': 6},
{'id': 'SU', 'funding': 9,  't':  10},
])


We can do the following to get both the cumulative funding received and the latest round of funding:

from lifelines.utils import to_long_format

df = seed_df.pipe(to_long_format, 'T')\
.pipe(add_covariate_to_timeline, cv, 'id', 't', 'E', cumulative_sum=True)\
.pipe(add_covariate_to_timeline, cv, 'id', 't', 'E', cumulative_sum=False)

"""
start  cumsum_funding  funding  stop  id      E
0      0             0.0      0.0   5.0  FB  False
1      5            30.0     30.0  10.0  FB  False
2     10            45.0     15.0  12.0  FB   True
3      0             0.0      0.0   6.0  SU  False
4      6            10.0     10.0  10.0  SU  False
5     10            19.0      9.0  10.0  SU   True
"""


### Sample size determination under a CoxPH model¶

Suppose you wish to measure the hazard ratio between two populations under the CoxPH model. That is, we want to evaluate the hypothesis H0: relative hazard ratio = 1 vs H1: relative hazard ratio != 1, where the relative hazard ratio is $$\exp{\left(\beta\right)}$$ for the experiment group vs the control group. A priori, we are interested in the sample sizes of the two groups necessary to achieve a certain statistical power. To do this in lifelines, there is the lifelines.statistics.sample_size_necessary_under_cph() function. For example:

from lifelines.statistics import sample_size_necessary_under_cph

desired_power = 0.8
ratio_of_participants = 1.
p_exp = 0.25
p_con = 0.35
postulated_hazard_ratio = 0.7
n_exp, n_con = sample_size_necessary_under_cph(desired_power, ratio_of_participants, p_exp, p_con, postulated_hazard_ratio)
# (421, 421)


This assumes you have estimates of the probability of event occurring for both the experiment and control group. This could be determined from previous experiments.

### Power determination under a CoxPH model¶

Suppose you wish to measure the hazard ratio between two populations under the CoxPH model. To determine the statistical power of a hazard ratio hypothesis test, under the CoxPH model, we can use lifelines.statistics.power_under_cph(). That is, suppose we want to know the probability that we reject the null hypothesis that the relative hazard ratio is 1, assuming the relative hazard ratio is truly different from 1. This function will give you that probability.

from lifelines.statistics import power_under_cph

n_exp = 50
n_con = 100
p_exp = 0.25
p_con = 0.35
postulated_hazard_ratio = 0.5
power = power_under_cph(n_exp, n_con, p_exp, p_con, postulated_hazard_ratio)
# 0.4957


### Problems with convergence in the Cox proportional hazard model¶

Since the estimation of the coefficients in the Cox proportional hazard model is done using the Newton-Raphson algorithm, there are sometimes problems with convergence. Here are some common symptoms and resolutions:

1. First check: look for ConvergenceWarning in the output. Most often problems in convergence are the result of problems in the dataset. lifelines has checks it runs against the dataset before fitting and warnings are outputted to the user.
2. delta contains nan value(s).: First try adding show_progress=True in the fit function. If the values in delta grow unbounded, it’s possible the step_size is too large. Try setting it to a small value (0.1-0.5).
3. Convergence halted due to matrix inversion problems: This means that there is high collinearity in your dataset. That is, a column is equal to the linear combination of 1 or more other columns. A common cause of this error is dummying categorical variables but not dropping a column, or some hierarchical structure in your dataset. Try to find the relationship by:
1. adding a penalizer to the model, ex: CoxPHFitter(penalizer=0.1).fit(…) until the model converges. In the print_summary(), the coefficients that have high collinearity will have large (absolute) magnitude in the coefs column.
2. using the variance inflation factor (VIF) to find redundant variables.
3. looking at the correlation matrix of your dataset, or
4. Some coefficients are many orders of magnitude larger than others, and the standard error of the coefficient is also large or there are nan’s in the results. This can be seen using the print_summary method on a fitted CoxPHFitter object.
1. Look for a ConvergenceWarning about variances being too small. The dataset may contain a constant column, which provides no information for the regression (Cox model doesn’t have a traditional “intercept” term like other regression models).
2. The data is completely separable, which means that there exists a covariate the completely determines whether an event occurred or not. For example, for all “death” events in the dataset, there exists a covariate that is constant amongst all of them. Look for a ConvergenceWarning after the fit call. See https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression
3. Related to above, the relationship between a covariate and the duration may be completely determined. For example, if the rank correlation between a covariate and the duration is very close to 1 or -1, then the log-likelihood can be increased arbitrarily using just that covariate. Look for a ConvergenceWarning after the fit call.
4. Another problem may be a collinear relationship in your dataset. See point 3. above.
5. If adding a very small penalizer significantly changes the results (CoxPHFitter(penalizer=0.0001)), then this probably means that the step size in the iterative algorithm is too large. Try decreasing it (.fit(..., step_size=0.50) or smaller), and returning the penalizer term to 0.
6. If using the strata argument, make sure your stratification group sizes are not too small. Try df.groupby(strata).size().

### Adding weights to observations in a Cox model¶

There are two common uses for weights in a model. The first is as a data size reduction technique (known as case weights). If the dataset has more than one subjects with identical attributes, including duration and event, then their likelihood contribution is the same as well. Thus, instead of computing the log-likelihood for each individual, we can compute it once and multiple it by the count of users with identical attributes. In practice, this involves first grouping subjects by covariates and counting. For example, using the Rossi dataset, we will use Pandas to group by the attributes (but other data processing tools, like Spark, could do this as well):

from lifelines.datasets import load_rossi

rossi_weights = rossi.copy()
rossi_weights['weights'] = 1.
rossi_weights = rossi_weights.groupby(rossi.columns.tolist())['weights'].sum()\
.reset_index()


The original dataset has 432 rows, while the grouped dataset has 387 rows plus an additional weights column. CoxPHFitter has an additional parameter to specify which column is the weight column.

from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(rossi_weights, 'week', 'arrest', weights_col='weights')


The fitting should be faster, and the results identical to the unweighted dataset. This option is also available in the CoxTimeVaryingFitter.

The second use of weights is sampling weights. These are typically positive, non-integer weights that represent some artificial under/over sampling of observations (ex: inverse probability of treatment weights). It is recommended to set robust=True in the call to the fit as the usual standard error is incorrect for sampling weights. The robust flag will use the sandwich estimator for the standard error.

Warning

The implementation of the sandwich estimator does not handle ties correctly (under the Efron handling of ties), and will give slightly or significantly different results from other software depending on the frequency of ties.

### Correlations between subjects in a Cox model¶

There are cases when your dataset contains correlated subjects, which breaks the independent-and-identically-distributed assumption. What are some cases when this may happen?

1. If a subject appears more than once in the dataset (common when subjects can have the event more than once)
2. If using a matching technique, like propensity-score matching, there is a correlation between pairs.

In both cases, the reported standard errors from a unadjusted Cox model will be wrong. In order to adjust for these correlations, there is a cluster_col keyword in fit() that allows you to specify the column in the DataFrame that contains designations for correlated subjects. For example, if subjects in rows 1 & 2 are correlated, but no other subjects are correlated, then cluster_col column should have the same value for rows 1 & 2, and all others unique. Another example: for matched pairs, each subject in the pair should have the same value.

from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter

# this may come from a database, or other libraries that specialize in matching
mathed_pairs = [
(156, 230),
(275, 228),
(61, 252),
(364, 201),
(54, 340),
(130, 33),
(183, 145),
(268, 140),
(332, 259),
(314, 413),
(330, 211),
(372, 255),
# ...
]

rossi['id'] = None  # we will populate this column

for i, pair in enumerate(matched_pairs):
subjectA, subjectB = pair
rossi.loc[subjectA, 'id'] = i
rossi.loc[subjectB, 'id'] = i

rossi = rossi.dropna(subset=['id'])

cph = CoxPHFitter()
cph.fit(rossi, 'week', 'arrest', cluster_col='id')


Specifying cluster_col will handle correlations, and invoke the robust sandwich estimator for standard errors (the same as setting robust=True).

### Serialize a lifelines model to disk¶

When you want to save (and later load) a lifelines model to disk, you can use the loads and dumps API from any popular serialization library.

from dill import loads, dumps

s_cph = dumps(cph)
cph.summary

s_kmf = dumps(kmf)
kmf.summary


### Produce a LaTex or HTML table¶

New in version 0.23.1, lifelines models now have the ability to output a LaTeX or HTML table from the print_summary option:

from lifelines.datasets import load_rossi
from lifelines import CoxPHFitter

cph = CoxPHFitter().fit(rossi, 'week', 'arrest')

# print a LaTeX table:
cph.print_summary(style="latex")

# print a HTML summary and table:
cph.print_summary(style="html")


## API Reference¶

### Univariate models¶

#### AalenJohansenFitter¶

class lifelines.fitters.aalen_johansen_fitter.AalenJohansenFitter(jitter_level=0.0001, seed=None, alpha=0.05, calculate_variance=True, **kwargs)

Bases: lifelines.fitters.UnivariateFitter

Class for fitting the Aalen-Johansen estimate for the cumulative incidence function in a competing risks framework. Treating competing risks as censoring can result in over-estimated cumulative density functions. Using the Kaplan Meier estimator with competing risks as censored is akin to estimating the cumulative density if all competing risks had been prevented.

Aalen-Johansen cannot deal with tied times. We can get around this by randomly jittering the event times slightly. This will be done automatically and generates a warning.

Parameters: alpha (float, option (default=0.05)) – The alpha value associated with the confidence intervals. jitter_level (float, option (default=0.00001)) – If tied event times are detected, event times are randomly changed by this factor. seed (int, option (default=None)) – To produce replicate results with tied event times, the numpy.random.seed can be specified in the function. calculate_variance (bool, option (default=True)) – By default, AalenJohansenFitter calculates the variance and corresponding confidence intervals. Due to how the variance is calculated, the variance must be calculated for each event time individually. This is computationally intensive. For some procedures, like bootstrapping, the variance is not necessary. To reduce computation time during these procedures, calculate_variance can be set to False to skip the variance calculation.

Example

>>> from lifelines import AalenJohansenFitter
>>> ajf = AalenJohansenFitter(calculate_variance=True)
>>> ajf.fit(T, E, event_of_interest=1)
>>> ajf.cumulative_density_
>>> ajf.plot()


References

If you are interested in learning more, we recommend the following open-access paper; Edwards JK, Hester LL, Gokhale M, Lesko CR. Methodologic Issues When Estimating Risks in Pharmacoepidemiology. Curr Epidemiol Rep. 2016;3(4):285-296.

conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

cumulative_density_at_times(times, label=None)
cumulative_hazard_at_times(times, label=None)
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
fit(durations, event_observed, event_of_interest, timeline=None, entry=None, label=None, alpha=None, ci_labels=None, weights=None)
Parameters: durations (an array or pd.Series of length n – duration of subject was observed for) event_observed (an array, or pd.Series, of length n. Integer indicator of distinct events. Must be) – only positive integers, where 0 indicates censoring. event_of_interest (integer – indicator for event of interest. All other integers are considered competing events) – Ex) event_observed contains 0, 1, 2 where 0:censored, 1:lung cancer, and 2:death. If event_of_interest=1, then death (2) is considered a competing event. The returned cumulative incidence function corresponds to risk of lung cancer timeline (return the best estimate at the values in timelines (positively increasing)) entry (an array, or pd.Series, of length n – relative time when a subject entered the study. This is) – useful for left-truncated (not left-censored) observations. If None, all members of the population were born at time 0. label (a string to name the column of the estimate.) alpha (the alpha value in the confidence intervals. Overrides the initializing) – alpha for this call to fit only. ci_labels (add custom column names to the generated confidence intervals) – as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None)
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p: float) → float

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Plots a pretty figure of the model

Matplotlib plot arguments can be passed in inside the kwargs, plus

Parameters: show_censors (bool) – place markers at censorship events. Default: False censor_styles (dict) – If show_censors, this dictionary will be passed into the plot call. ci_alpha (float) – the transparency level of the confidence interval. Default: 0.3 ci_force_lines (bool) – force the confidence intervals to be line plots (versus default shaded areas). Default: False ci_show (bool) – show confidence intervals. Default: True ci_legend (bool) – if ci_force_lines is True, this is a boolean flag to add the lines’ labels to the legend. Default: False at_risk_counts (bool) – show group sizes at time points. See function add_at_risk_counts for details. Default: False loc (slice) – specify a time-based subsection of the curves to plot, ex: >>> model.plot(loc=slice(0.,10.))  will plot the time values between t=0. and t=10. iloc (slice) – specify a location-based subsection of the curves to plot, ex: >>> model.plot(iloc=slice(0,10))  will plot the first 10 time points. a pyplot axis object ax
plot_cumulative_density(**kwargs)
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_survival_function(**kwargs)
predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
survival_function_at_times(times, label=None)

#### BreslowFlemingHarringtonFitter¶

class lifelines.fitters.breslow_fleming_harrington_fitter.BreslowFlemingHarringtonFitter(alpha: float = 0.05, label: str = None)

Bases: lifelines.fitters.UnivariateFitter

Class for fitting the Breslow-Fleming-Harrington estimate for the survival function. This estimator is a biased estimator of the survival function but is more stable when the population is small and there are too few early truncation times, it may happen that is the number of patients at risk and the number of deaths is the same.

Mathematically, the Nelson-Aalen estimator is the negative logarithm of the Breslow-Fleming-Harrington estimator.

Parameters: alpha (float, optional (default=0.05)) – The alpha value associated with the confidence intervals.
conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

cumulative_density_at_times(times, label=None)
cumulative_hazard_at_times(times, label=None)
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
fit(durations, event_observed=None, timeline=None, entry=None, label=None, alpha=None, ci_labels=None, weights=None)
Parameters: durations (an array, or pd.Series, of length n) – duration subject was observed for timeline – return the best estimate at the values in timelines (positively increasing) event_observed (an array, or pd.Series, of length n) – True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for left-truncated observations, i.e the birth event was not observed. If None, defaults to all 0 (all birth events observed.) label (string) – a string to name the column of the estimate. alpha (float, optional (default=0.05)) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (iterable) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None)
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p: float) → float

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Plots a pretty figure of the model

Matplotlib plot arguments can be passed in inside the kwargs, plus

Parameters: show_censors (bool) – place markers at censorship events. Default: False censor_styles (dict) – If show_censors, this dictionary will be passed into the plot call. ci_alpha (float) – the transparency level of the confidence interval. Default: 0.3 ci_force_lines (bool) – force the confidence intervals to be line plots (versus default shaded areas). Default: False ci_show (bool) – show confidence intervals. Default: True ci_legend (bool) – if ci_force_lines is True, this is a boolean flag to add the lines’ labels to the legend. Default: False at_risk_counts (bool) – show group sizes at time points. See function add_at_risk_counts for details. Default: False loc (slice) – specify a time-based subsection of the curves to plot, ex: >>> model.plot(loc=slice(0.,10.))  will plot the time values between t=0. and t=10. iloc (slice) – specify a location-based subsection of the curves to plot, ex: >>> model.plot(iloc=slice(0,10))  will plot the first 10 time points. a pyplot axis object ax
plot_cumulative_density(**kwargs)
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_survival_function(**kwargs)
predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
survival_function_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted survival value at specific times

Parameters: times (iterable or float) label (str)

#### ExponentialFitter¶

class lifelines.fitters.exponential_fitter.ExponentialFitter(*args, **kwargs)

Bases: lifelines.fitters.KnownModelParametricUnivariateFitter

This class implements an Exponential model for univariate data. The model has parameterized form:

$S(t) = \exp\left(\frac{-t}{\lambda}\right), \lambda >0$

which implies the cumulative hazard rate is

$H(t) = \frac{t}{\lambda}$

and the hazard rate is:

$h(t) = \frac{1}{\lambda}$

After calling the .fit method, you have access to properties like: survival_function_, lambda_, cumulative_hazard_ A summary of the fit is available with the method print_summary()

Parameters: alpha (float, optional (default=0.05)) – the level in the confidence intervals.

Important

The parameterization of this model changed in lifelines 0.19.0. Previously, the cumulative hazard looked like $$\lambda t$$. The parameterization is now the reciprocal of $$\lambda$$.

cumulative_hazard_

The estimated cumulative hazard (with custom timeline if provided)

Type: DataFrame
confidence_interval_cumulative_hazard_

The lower and upper confidence intervals for the cumulative hazard

Type: DataFrame
hazard_

The estimated hazard (with custom timeline if provided)

Type: DataFrame
confidence_interval_hazard_

The lower and upper confidence intervals for the hazard

Type: DataFrame
survival_function_

The estimated survival function (with custom timeline if provided)

Type: DataFrame
confidence_interval_survival_function_

The lower and upper confidence intervals for the survival function

Type: DataFrame
variance_matrix_

The variance matrix of the coefficients

Type: numpy array
median_survival_time_

The median time to event

Type: float
lambda_

The fitted parameter in the model

Type: float
durations

The durations provided

Type: array
event_observed

The event_observed variable provided

Type: array
timeline

The time line to use for plotting and indexing

Type: array
entry

The entry array provided, or None

Type: array or None
cumumlative_density_

The estimated cumulative density function (with custom timeline if provided)

Type: DataFrame
confidence_interval_cumulative_density_

The lower and upper confidence intervals for the cumulative density

Type: DataFrame
conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

confidence_interval_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_cumulative_hazard_.

confidence_interval_cumulative_density_

The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_.

confidence_interval_hazard_

The confidence interval of the hazard.

confidence_interval_survival_function_

The lower and upper confidence intervals for the survival function

cumulative_density_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative density function (1-survival function) at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.
cumulative_hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative hazard value at specific times.

Parameters: times (iterable or float) – values to return the cumulative hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
event_table
fit(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter
Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_interval_censoring(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to an interval censored dataset.

Parameters: lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in. upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored). event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from lower_bound and upper_cound (if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored) timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_left_censoring(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to a left-censored dataset

Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted hazard at specific times.

Parameters: times (iterable or float) – values to return the hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p)

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Produce a pretty-plot of the estimate.

plot_cumulative_density(**kwargs)
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_survival_function(**kwargs)
predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
print_summary(decimals=2, style=None, **kwargs)

Print summary statistics describing the fit, the coefficients, and the error bounds.

Parameters: decimals (int, optional (default=2)) – specify the number of decimal places to show style (string) – {html, ascii, latex} kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
summary

Summary statistics describing the fit.

print_summary

survival_function_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted survival value at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.

#### GeneralizedGammaFitter¶

class lifelines.fitters.generalized_gamma_fitter.GeneralizedGammaFitter(*args, **kwargs)

Bases: lifelines.fitters.KnownModelParametricUnivariateFitter

This class implements a Generalized Gamma model for univariate data. The model has parameterized form:

The survival function is:

$\begin{split}S(t)=\left\{ \begin{array}{} 1-\Gamma_{RL}\left( \frac{1}{{{\lambda }^{2}}};\frac{{e}^{\lambda \left( \frac{\log(t)-\mu }{\sigma} \right)}}{\lambda ^{2}} \right) \textit{ if } \lambda> 0 \\ \Gamma_{RL}\left( \frac{1}{{{\lambda }^{2}}};\frac{{e}^{\lambda \left( \frac{\log(t)-\mu }{\sigma} \right)}}{\lambda ^{2}} \right) \textit{ if } \lambda \le 0 \\ \end{array} \right.\,\!\end{split}$

where $$\Gamma_{RL}$$ is the regularized lower incomplete Gamma function.

This model has the Exponential, Weibull, Gamma and Log-Normal as sub-models, and thus can be used as a way to test which model to use:

1. When $$\lambda = 1$$ and $$\sigma = 1$$, then the data is Exponential.
2. When $$\lambda = 1$$ then the data is Weibull.
3. When $$\sigma = \lambda$$ then the data is Gamma.
4. When $$\lambda = 0$$ then the data is Log-Normal.
5. When $$\lambda = -1$$ then the data is Inverse-Weibull.
6. When $$\sigma = -\lambda$$ then the data is Inverse-Gamma.

After calling the .fit method, you have access to properties like: cumulative_hazard_, survival_function_, A summary of the fit is available with the method print_summary().

Important

The parameterization implemented has $$\log\sigma$$, thus there is a ln_sigma_ in the output. Exponentiate this parameter to recover $$\sigma$$.

Important

This model is experimental. It’s API may change in the future. Also, it’s convergence is not very stable.

Parameters: alpha (float, optional (default=0.05)) – the level in the confidence intervals.

Examples

>>> from lifelines import GeneralizedGammaFitter
>>> ggf = GeneralizedGammaFitter()
>>> ggf.fit(waltons['T'], waltons['E'])
>>> ggf.plot()
>>> ggf.summary

cumulative_hazard_

The estimated cumulative hazard (with custom timeline if provided)

Type: DataFrame
hazard_

The estimated hazard (with custom timeline if provided)

Type: DataFrame
survival_function_

The estimated survival function (with custom timeline if provided)

Type: DataFrame
cumumlative_density_

The estimated cumulative density function (with custom timeline if provided)

Type: DataFrame
variance_matrix_

The variance matrix of the coefficients

Type: numpy array
median_survival_time_

The median time to event

Type: float
lambda_

The fitted parameter in the model

Type: float
rho_

The fitted parameter in the model

Type: float
alpha_

The fitted parameter in the model

Type: float
durations

The durations provided

Type: array
event_observed

The event_observed variable provided

Type: array
timeline

The time line to use for plotting and indexing

Type: array
entry

The entry array provided, or None

Type: array or None
conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

confidence_interval_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_cumulative_hazard_.

confidence_interval_cumulative_density_

The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_.

confidence_interval_hazard_

The confidence interval of the hazard.

confidence_interval_survival_function_

The lower and upper confidence intervals for the survival function

cumulative_density_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative density function (1-survival function) at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.
cumulative_hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative hazard value at specific times.

Parameters: times (iterable or float) – values to return the cumulative hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
event_table
fit(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter
Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_interval_censoring(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to an interval censored dataset.

Parameters: lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in. upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored). event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from lower_bound and upper_cound (if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored) timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_left_censoring(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to a left-censored dataset

Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted hazard at specific times.

Parameters: times (iterable or float) – values to return the hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p)

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Produce a pretty-plot of the estimate.

plot_cumulative_density(**kwargs)
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_survival_function(**kwargs)
predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
print_summary(decimals=2, style=None, **kwargs)

Print summary statistics describing the fit, the coefficients, and the error bounds.

Parameters: decimals (int, optional (default=2)) – specify the number of decimal places to show style (string) – {html, ascii, latex} kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
summary

Summary statistics describing the fit.

print_summary

survival_function_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted survival value at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.

#### KaplanMeierFitter¶

class lifelines.fitters.kaplan_meier_fitter.KaplanMeierFitter(alpha: float = 0.05, label: str = None)

Bases: lifelines.fitters.UnivariateFitter

Class for fitting the Kaplan-Meier estimate for the survival function.

Parameters: alpha (float, optional (default=0.05)) – The alpha value associated with the confidence intervals. label (string, optional) – Provide a new label for the estimate - useful if looking at many groups.

Examples

>>> from lifelines import KaplanMeierFitter
>>> kmf = KaplanMeierFitter(label="waltons_data")
>>> kmf.fit(waltons['T'], waltons['E'])
>>> kmf.plot()

survival_function_

The estimated survival function (with custom timeline if provided)

Type: DataFrame
median_survival_time_

The estimated median time to event. np.inf if doesn’t exist.

Type: float
confidence_interval_

The lower and upper confidence intervals for the survival function. An alias of confidence_interval_survival_function_. Uses Greenwood’s Exponential formula (“log-log” in R).

Type: DataFrame
confidence_interval_survival_function_

The lower and upper confidence intervals for the survival function. An alias of confidence_interval_. Uses Greenwood’s Exponential formula (“log-log” in R).

Type: DataFrame
cumumlative_density_

The estimated cumulative density function (with custom timeline if provided)

Type: DataFrame
confidence_interval_cumulative_density_

The lower and upper confidence intervals for the cumulative density.

Type: DataFrame
durations

The durations provided

Type: array
event_observed

The event_observed variable provided

Type: array
timeline

The time line to use for plotting and indexing

Type: array
entry

The entry array provided, or None

Type: array or None
event_table

A summary of the life table

Type: DataFrame
conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

cumulative_density_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative density at specific times

Parameters: times (iterable or float) pd.Series
cumulative_hazard_at_times(times, label=None)
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
fit(durations, event_observed=None, timeline=None, entry=None, label=None, alpha=None, ci_labels=None, weights=None)

Fit the model to a right-censored dataset

Parameters: durations (an array, list, pd.DataFrame or pd.Series) – length n – duration subject was observed for event_observed (an array, list, pd.DataFrame, or pd.Series, optional) – True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (an array, list, pd.DataFrame, or pd.Series, optional) – return the best estimate at the values in timelines (positively increasing) entry (an array, list, pd.DataFrame, or pd.Series, optional) – relative time when a subject entered the study. This is useful for left-truncated (not left-censored) observations. If None, all members of the population entered study when they were “born”. label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (tuple, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_left_censoring(durations, event_observed=None, timeline=None, entry=None, label=None, alpha=None, ci_labels=None, weights=None)

Fit the model to a left-censored dataset

Parameters: durations (an array, list, pd.DataFrame or pd.Series) – length n – duration subject was observed for event_observed (an array, list, pd.DataFrame, or pd.Series, optional) – True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (an array, list, pd.DataFrame, or pd.Series, optional) – return the best estimate at the values in timelines (positively increasing) entry (an array, list, pd.DataFrame, or pd.Series, optional) – relative time when a subject entered the study. This is useful for left-truncated (not left-censored) observations. If None, all members of the population entered study when they were “born”. label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (tuple, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None)
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p: float) → float

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Plots a pretty figure of the model

Matplotlib plot arguments can be passed in inside the kwargs, plus

Parameters: show_censors (bool) – place markers at censorship events. Default: False censor_styles (dict) – If show_censors, this dictionary will be passed into the plot call. ci_alpha (float) – the transparency level of the confidence interval. Default: 0.3 ci_force_lines (bool) – force the confidence intervals to be line plots (versus default shaded areas). Default: False ci_show (bool) – show confidence intervals. Default: True ci_legend (bool) – if ci_force_lines is True, this is a boolean flag to add the lines’ labels to the legend. Default: False at_risk_counts (bool) – show group sizes at time points. See function add_at_risk_counts for details. Default: False loc (slice) – specify a time-based subsection of the curves to plot, ex: >>> model.plot(loc=slice(0.,10.))  will plot the time values between t=0. and t=10. iloc (slice) – specify a location-based subsection of the curves to plot, ex: >>> model.plot(iloc=slice(0,10))  will plot the first 10 time points. a pyplot axis object ax
plot_cumulative_density(**kwargs)

Plots a pretty figure of the cumulative density function.

Matplotlib plot arguments can be passed in inside the kwargs.

Parameters: show_censors (bool) – place markers at censorship events. Default: False censor_styles (bool) – If show_censors, this dictionary will be passed into the plot call. ci_alpha (bool) – the transparency level of the confidence interval. Default: 0.3 ci_force_lines (bool) – force the confidence intervals to be line plots (versus default shaded areas). Default: False ci_show (bool) – show confidence intervals. Default: True ci_legend (bool) – if ci_force_lines is True, this is a boolean flag to add the lines’ labels to the legend. Default: False at_risk_counts (bool) – show group sizes at time points. See function add_at_risk_counts for details. Default: False loc (slice) – specify a time-based subsection of the curves to plot, ex: >>> model.plot(loc=slice(0.,10.))  will plot the time values between t=0. and t=10. iloc (slice) – specify a location-based subsection of the curves to plot, ex: >>> model.plot(iloc=slice(0,10))  will plot the first 10 time points. a pyplot axis object ax
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_loglogs(*args, **kwargs)

Plot $$\log(S(t))$$ against $$\log(t)$$. Same arguments as .plot.

plot_survival_function(**kwargs)

Alias of plot

predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
survival_function_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted survival value at specific times

Parameters: times (iterable or float) pd.Series

#### LogLogisticFitter¶

class lifelines.fitters.log_logistic_fitter.LogLogisticFitter(*args, **kwargs)

Bases: lifelines.fitters.KnownModelParametricUnivariateFitter

This class implements a Log-Logistic model for univariate data. The model has parameterized form:

$S(t) = \left(1 + \left(\frac{t}{\alpha}\right)^{\beta}\right)^{-1}, \alpha > 0, \beta > 0,$

The $$\alpha$$ (scale) parameter has an interpretation as being equal to the median lifetime of the population. The $$\beta$$ parameter influences the shape of the hazard. See figure below: The hazard rate is:

$h(t) = \frac{\left(\frac{\beta}{\alpha}\right)\left(\frac{t}{\alpha}\right) ^ {\beta-1}}{\left(1 + \left(\frac{t}{\alpha}\right)^{\beta}\right)}$

and the cumulative hazard is:

$H(t) = \log\left(\left(\frac{t}{\alpha}\right) ^ {\beta} + 1\right)$

After calling the .fit method, you have access to properties like: cumulative_hazard_, plot, survival_function_, alpha_ and beta_. A summary of the fit is available with the method ‘print_summary()’

Parameters: alpha (float, optional (default=0.05)) – the level in the confidence intervals.

Examples

>>> from lifelines import LogLogisticFitter
>>> llf = LogLogisticFitter()
>>> llf.fit(waltons['T'], waltons['E'])
>>> llf.plot()
>>> print(llf.alpha_)

cumulative_hazard_

The estimated cumulative hazard (with custom timeline if provided)

Type: DataFrame
hazard_

The estimated hazard (with custom timeline if provided)

Type: DataFrame
survival_function_

The estimated survival function (with custom timeline if provided)

Type: DataFrame
cumumlative_density_

The estimated cumulative density function (with custom timeline if provided)

Type: DataFrame
variance_matrix_

The variance matrix of the coefficients

Type: numpy array
median_survival_time_

The median time to event

Type: float
alpha_

The fitted parameter in the model

Type: float
beta_

The fitted parameter in the model

Type: float
durations

The durations provided

Type: array
event_observed

The event_observed variable provided

Type: array
timeline

The time line to use for plotting and indexing

Type: array
entry

The entry array provided, or None

Type: array or None
conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

confidence_interval_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_cumulative_hazard_.

confidence_interval_cumulative_density_

The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_.

confidence_interval_hazard_

The confidence interval of the hazard.

confidence_interval_survival_function_

The lower and upper confidence intervals for the survival function

cumulative_density_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative density function (1-survival function) at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.
cumulative_hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative hazard value at specific times.

Parameters: times (iterable or float) – values to return the cumulative hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
event_table
fit(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter
Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_interval_censoring(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to an interval censored dataset.

Parameters: lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in. upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored). event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from lower_bound and upper_cound (if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored) timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_left_censoring(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to a left-censored dataset

Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted hazard at specific times.

Parameters: times (iterable or float) – values to return the hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p)

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Produce a pretty-plot of the estimate.

plot_cumulative_density(**kwargs)
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_survival_function(**kwargs)
predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
print_summary(decimals=2, style=None, **kwargs)

Print summary statistics describing the fit, the coefficients, and the error bounds.

Parameters: decimals (int, optional (default=2)) – specify the number of decimal places to show style (string) – {html, ascii, latex} kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
summary

Summary statistics describing the fit.

print_summary

survival_function_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted survival value at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.

#### LogNormalFitter¶

class lifelines.fitters.log_normal_fitter.LogNormalFitter(*args, **kwargs)

Bases: lifelines.fitters.KnownModelParametricUnivariateFitter

This class implements an Log Normal model for univariate data. The model has parameterized form:

$S(t) = 1 - \Phi\left(\frac{\log(t) - \mu}{\sigma}\right), \;\; \sigma >0$

where $$\Phi$$ is the CDF of a standard normal random variable. This implies the cumulative hazard rate is

$H(t) = -\log\left(1 - \Phi\left(\frac{\log(t) - \mu}{\sigma}\right)\right)$

After calling the .fit method, you have access to properties like: survival_function_, mu_, sigma_. A summary of the fit is available with the method print_summary()

Parameters: alpha (float, optional (default=0.05)) – the level in the confidence intervals.
cumulative_hazard_

The estimated cumulative hazard (with custom timeline if provided)

Type: DataFrame
hazard_

The estimated hazard (with custom timeline if provided)

Type: DataFrame
survival_function_

The estimated survival function (with custom timeline if provided)

Type: DataFrame
cumumlative_density_

The estimated cumulative density function (with custom timeline if provided)

Type: DataFrame
variance_matrix_

The variance matrix of the coefficients

Type: numpy array
median_survival_time_

The median time to event

Type: float
mu_

The fitted parameter in the model

Type: float
sigma_

The fitted parameter in the model

Type: float
durations

The durations provided

Type: array
event_observed

The event_observed variable provided

Type: array
timeline

The time line to use for plotting and indexing

Type: array
entry

The entry array provided, or None

Type: array or None
conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

confidence_interval_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_cumulative_hazard_.

confidence_interval_cumulative_density_

The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_.

confidence_interval_hazard_

The confidence interval of the hazard.

confidence_interval_survival_function_

The lower and upper confidence intervals for the survival function

cumulative_density_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative density function (1-survival function) at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.
cumulative_hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative hazard value at specific times.

Parameters: times (iterable or float) – values to return the cumulative hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
event_table
fit(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter
Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_interval_censoring(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to an interval censored dataset.

Parameters: lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in. upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored). event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from lower_bound and upper_cound (if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored) timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_left_censoring(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to a left-censored dataset

Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted hazard at specific times.

Parameters: times (iterable or float) – values to return the hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p)

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Produce a pretty-plot of the estimate.

plot_cumulative_density(**kwargs)
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_survival_function(**kwargs)
predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
print_summary(decimals=2, style=None, **kwargs)

Print summary statistics describing the fit, the coefficients, and the error bounds.

Parameters: decimals (int, optional (default=2)) – specify the number of decimal places to show style (string) – {html, ascii, latex} kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
summary

Summary statistics describing the fit.

print_summary

survival_function_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted survival value at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.

#### NelsonAalenFitter¶

class lifelines.fitters.nelson_aalen_fitter.NelsonAalenFitter(alpha=0.05, nelson_aalen_smoothing=True, **kwargs)

Bases: lifelines.fitters.UnivariateFitter

Class for fitting the Nelson-Aalen estimate for the cumulative hazard.

NelsonAalenFitter(alpha=0.05, nelson_aalen_smoothing=True)

Parameters: alpha (float, optional (default=0.05)) – The alpha value associated with the confidence intervals. nelson_aalen_smoothing (bool, optional) – If the event times are naturally discrete (like discrete years, minutes, etc.) then it is advisable to turn this parameter to False. See , pg.84.

Notes

 Aalen, O., Borgan, O., Gjessing, H., 2008. Survival and Event History Analysis

cumulative_hazard_

The estimated cumulative hazard (with custom timeline if provided)

Type: DataFrame
confidence_interval_

The lower and upper confidence intervals for the cumulative hazard

Type: DataFrame
durations

The durations provided

Type: array
event_observed

The event_observed variable provided

Type: array
timeline

The time line to use for plotting and indexing

Type: array
entry

The entry array provided, or None

Type: array or None
event_table

A summary of the life table

Type: DataFrame
conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

cumulative_density_at_times(times, label=None)
cumulative_hazard_at_times(times, label=None)
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
fit(durations, event_observed=None, timeline=None, entry=None, label=None, alpha=None, ci_labels=None, weights=None)
Parameters: durations (an array, or pd.Series, of length n) – duration subject was observed for timeline (iterable) – return the best estimate at the values in timelines (positively increasing) event_observed (an array, or pd.Series, of length n) – True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None entry (an array, or pd.Series, of length n) – relative time when a subject entered the study. This is useful for left-truncated observations, i.e the birth event was not observed. If None, defaults to all 0 (all birth events observed.) label (string) – a string to name the column of the estimate. alpha (float) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (iterable) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None)
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p)

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Plots a pretty figure of the model

Matplotlib plot arguments can be passed in inside the kwargs, plus

Parameters: show_censors (bool) – place markers at censorship events. Default: False censor_styles (dict) – If show_censors, this dictionary will be passed into the plot call. ci_alpha (float) – the transparency level of the confidence interval. Default: 0.3 ci_force_lines (bool) – force the confidence intervals to be line plots (versus default shaded areas). Default: False ci_show (bool) – show confidence intervals. Default: True ci_legend (bool) – if ci_force_lines is True, this is a boolean flag to add the lines’ labels to the legend. Default: False at_risk_counts (bool) – show group sizes at time points. See function add_at_risk_counts for details. Default: False loc (slice) – specify a time-based subsection of the curves to plot, ex: >>> model.plot(loc=slice(0.,10.))  will plot the time values between t=0. and t=10. iloc (slice) – specify a location-based subsection of the curves to plot, ex: >>> model.plot(iloc=slice(0,10))  will plot the first 10 time points. a pyplot axis object ax
plot_cumulative_density(**kwargs)
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_survival_function(**kwargs)
predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
smoothed_hazard_(bandwidth)
Parameters: bandwidth (float) – the bandwith used in the Epanechnikov kernel. a DataFrame of the smoothed hazard DataFrame
smoothed_hazard_confidence_intervals_(bandwidth, hazard_=None)
Parameters: bandwidth (float) – the bandwidth to use in the Epanechnikov kernel. > 0 hazard_ (numpy array) – a computed (n,) numpy array of estimated hazard rates. If none, uses smoothed_hazard_
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
survival_function_at_times(times, label=None)

#### PiecewiseExponentialFitter¶

class lifelines.fitters.piecewise_exponential_fitter.PiecewiseExponentialFitter(breakpoints, *args, **kwargs)

Bases: lifelines.fitters.KnownModelParametricUnivariateFitter

This class implements an Piecewise Exponential model for univariate data. The model has parameterized hazard rate:

$\begin{split}h(t) = \begin{cases} 1/\lambda_0 & \text{if t \le \tau_0} \\ 1/\lambda_1 & \text{if \tau_0 < t \le \tau_1} \\ 1/\lambda_2 & \text{if \tau_1 < t \le \tau_2} \\ ... \end{cases}\end{split}$

You specify the breakpoints, $$\tau_i$$, and lifelines will find the optional values for the parameters.

After calling the .fit method, you have access to properties like: survival_function_, plot, cumulative_hazard_ A summary of the fit is available with the method print_summary()

Parameters: breakpoints (list) – a list of times when a new exponential model is constructed. alpha (float, optional (default=0.05)) – the level in the confidence intervals.

Important

The parameterization of this model changed in lifelines 0.19.1. Previously, the cumulative hazard looked like $$\lambda_i t$$. The parameterization is now the reciprocal of $$\lambda_i$$.

cumulative_hazard_

The estimated cumulative hazard (with custom timeline if provided)

Type: DataFrame
hazard_

The estimated hazard (with custom timeline if provided)

Type: DataFrame
survival_function_

The estimated survival function (with custom timeline if provided)

Type: DataFrame
cumumlative_density_

The estimated cumulative density function (with custom timeline if provided)

Type: DataFrame
variance_matrix_

The variance matrix of the coefficients

Type: numpy array
median_survival_time_

The median time to event

Type: float
lambda_i_

The fitted parameter in the model, for i = 0, 1 … n-1 breakpoints

Type: float
durations

The durations provided

Type: array
event_observed

The event_observed variable provided

Type: array
timeline

The time line to use for plotting and indexing

Type: array
entry

The entry array provided, or None

Type: array or None
breakpoints

The provided breakpoints

Type: array
conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

confidence_interval_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_cumulative_hazard_.

confidence_interval_cumulative_density_

The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_.

confidence_interval_hazard_

The confidence interval of the hazard.

confidence_interval_survival_function_

The lower and upper confidence intervals for the survival function

cumulative_density_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative density function (1-survival function) at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.
cumulative_hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative hazard value at specific times.

Parameters: times (iterable or float) – values to return the cumulative hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
event_table
fit(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter
Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_interval_censoring(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to an interval censored dataset.

Parameters: lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in. upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored). event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from lower_bound and upper_cound (if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored) timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_left_censoring(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to a left-censored dataset

Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted hazard at specific times.

Parameters: times (iterable or float) – values to return the hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p: float) → float

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Produce a pretty-plot of the estimate.

plot_cumulative_density(**kwargs)
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_survival_function(**kwargs)
predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
print_summary(decimals=2, style=None, **kwargs)

Print summary statistics describing the fit, the coefficients, and the error bounds.

Parameters: decimals (int, optional (default=2)) – specify the number of decimal places to show style (string) – {html, ascii, latex} kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
summary

Summary statistics describing the fit.

print_summary

survival_function_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted survival value at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.

#### SplineFitter¶

class lifelines.fitters.spline_fitter.SplineFitter(knot_locations: numpy.ndarray, *args, **kwargs)

Bases: lifelines.fitters.spline_fitter.SplineFitterMixin, lifelines.fitters.KnownModelParametricUnivariateFitter

Model the cumulative hazard using cubic splines. This offers great flexibility and smoothness of the cumulative hazard.

$H(t) = \exp{\phi_0 + \phi_1\log{t} + \sum_{j=2}^N \phi_j v_j(\log{t})$

where $$v_j$$ are our cubic basis functions at predetermined knots. See references for exact definition.

Parameters: knot_locations (list, np.array) – The locations of the cubic breakpoints. Typically, the first knot is the minimum observed death, the last knot is the maximum observed death, and the knots in between are the centiles of observed data (ex: if one additional knot, choose the 50th percentile, the median. If two additional knots, choose the 33th and 66th percentiles).

References

Royston, P., & Parmar, M. K. B. (2002). Flexible parametric proportional-hazards and proportional-odds models for censored survival data, with application to prognostic modelling and estimation of treatment effects. Statistics in Medicine, 21(15), 2175–2197. doi:10.1002/sim.1203

Examples

>>> from lifelines import SplineFitter
>>> T, E = waltons['T'], waltons['E']
>>> knots = np.percentile(T.loc[E.astype(bool)], [0, 50, 100])
>>> sf = SplineFitter(knots)
>>> sf.fit()
>>> sf.plot()
>>> print(sf.knots)

cumulative_hazard_

The estimated cumulative hazard (with custom timeline if provided)

Type: DataFrame
hazard_

The estimated hazard (with custom timeline if provided)

Type: DataFrame
survival_function_

The estimated survival function (with custom timeline if provided)

Type: DataFrame
cumumlative_density_

The estimated cumulative density function (with custom timeline if provided)

Type: DataFrame
variance_matrix_

The variance matrix of the coefficients

Type: numpy array
median_survival_time_

The median time to event

Type: float
lambda_

The fitted parameter in the model

Type: float
rho_

The fitted parameter in the model

Type: float
durations

The durations provided

Type: array
event_observed

The event_observed variable provided

Type: array
timeline

The time line to use for plotting and indexing

Type: array
entry

The entry array provided, or None

Type: array or None
knot_locations

The locations of the cubic breakpoints.

basis(x, knot, min_knot, max_knot)
conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

confidence_interval_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_cumulative_hazard_.

confidence_interval_cumulative_density_

The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_.

confidence_interval_hazard_

The confidence interval of the hazard.

confidence_interval_survival_function_

The lower and upper confidence intervals for the survival function

cumulative_density_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative density function (1-survival function) at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.
cumulative_hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative hazard value at specific times.

Parameters: times (iterable or float) – values to return the cumulative hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
event_table
fit(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter
Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_interval_censoring(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to an interval censored dataset.

Parameters: lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in. upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored). event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from lower_bound and upper_cound (if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored) timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_left_censoring(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to a left-censored dataset

Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted hazard at specific times.

Parameters: times (iterable or float) – values to return the hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p: float) → float

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Produce a pretty-plot of the estimate.

plot_cumulative_density(**kwargs)
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_survival_function(**kwargs)
predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
print_summary(decimals=2, style=None, **kwargs)

Print summary statistics describing the fit, the coefficients, and the error bounds.

Parameters: decimals (int, optional (default=2)) – specify the number of decimal places to show style (string) – {html, ascii, latex} kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.
static relu(x)
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
summary

Summary statistics describing the fit.

print_summary

survival_function_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted survival value at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.

#### WeibullFitter¶

class lifelines.fitters.weibull_fitter.WeibullFitter(*args, **kwargs)

Bases: lifelines.fitters.KnownModelParametricUnivariateFitter

This class implements a Weibull model for univariate data. The model has parameterized form:

$S(t) = \exp\left(-\left(\frac{t}{\lambda}\right)^\rho\right), \lambda > 0, \rho > 0,$

The $$\lambda$$ (scale) parameter has an applicable interpretation: it represent the time when 37% of the population has died. The $$\rho$$ (shape) parameter controls if the cumulative hazard (see below) is convex or concave, representing accelerating or decelerating hazards. The cumulative hazard rate is

$H(t) = \left(\frac{t}{\lambda}\right)^\rho,$

and the hazard rate is:

$h(t) = \frac{\rho}{\lambda}\left(\frac{t}{\lambda}\right)^{\rho-1}$

After calling the .fit method, you have access to properties like: cumulative_hazard_, survival_function_, lambda_ and rho_. A summary of the fit is available with the method print_summary().

Parameters: alpha (float, optional (default=0.05)) – the level in the confidence intervals.

Important

The parameterization of this model changed in lifelines 0.19.0. Previously, the cumulative hazard looked like $$(\lambda t)^\rho$$. The parameterization is now the reciprocal of $$\lambda$$.

Examples

>>> from lifelines import WeibullFitter
>>> wbf = WeibullFitter()
>>> wbf.fit(waltons['T'], waltons['E'])
>>> wbf.plot()
>>> print(wbf.lambda_)

cumulative_hazard_

The estimated cumulative hazard (with custom timeline if provided)

Type: DataFrame
hazard_

The estimated hazard (with custom timeline if provided)

Type: DataFrame
survival_function_

The estimated survival function (with custom timeline if provided)

Type: DataFrame
cumumlative_density_

The estimated cumulative density function (with custom timeline if provided)

Type: DataFrame
variance_matrix_

The variance matrix of the coefficients

Type: numpy array
median_survival_time_

The median time to event

Type: float
lambda_

The fitted parameter in the model

Type: float
rho_

The fitted parameter in the model

Type: float
durations

The durations provided

Type: array
event_observed

The event_observed variable provided

Type: array
timeline

The time line to use for plotting and indexing

Type: array
entry

The entry array provided, or None

Type: array or None

Notes

Looking for a 3-parameter Weibull model? See notes here.

conditional_time_to_event_

Return a DataFrame, with index equal to survival_function_, that estimates the median duration remaining until the death event, given survival up until time t. For example, if an individual exists until age 1, their expected life remaining given they lived to time 1 might be 9 years.

confidence_interval_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_cumulative_hazard_.

confidence_interval_cumulative_density_

The lower and upper confidence intervals for the cumulative density

confidence_interval_cumulative_hazard_

The confidence interval of the cumulative hazard. This is an alias for confidence_interval_.

confidence_interval_hazard_

The confidence interval of the hazard.

confidence_interval_survival_function_

The lower and upper confidence intervals for the survival function

cumulative_density_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative density function (1-survival function) at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.
cumulative_hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted cumulative hazard value at specific times.

Parameters: times (iterable or float) – values to return the cumulative hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
divide(other) → pandas.core.frame.DataFrame

Divide the {0} of two {1} objects.

Parameters: other (same object as self)
event_table
fit(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter
Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_interval_censoring(lower_bound, upper_bound, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to an interval censored dataset.

Parameters: lower_bound (an array, or pd.Series) – length n, the start of the period the subject experienced the event in. upper_bound (an array, or pd.Series) – length n, the end of the period the subject experienced the event in. If the value is equal to the corresponding value in lower_bound, then the individual’s event was observed (not censored). event_observed (numpy array or pd.Series, optional) – length n, if left optional, infer from lower_bound and upper_cound (if lower_bound==upper_bound then event observed, if lower_bound < upper_bound, then event censored) timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_left_censoring(durations, event_observed=None, timeline=None, label=None, alpha=None, ci_labels=None, show_progress=False, entry=None, weights=None, initial_point=None) → ParametricUnivariateFitter

Fit the model to a left-censored dataset

Parameters: durations (an array, or pd.Series) – length n, duration subject was observed for event_observed (numpy array or pd.Series, optional) – length n, True if the the death was observed, False if the event was lost (right-censored). Defaults all True if event_observed==None timeline (list, optional) – return the estimate at the values in timeline (positively increasing) label (string, optional) – a string to name the column of the estimate. alpha (float, optional) – the alpha value in the confidence intervals. Overrides the initializing alpha for this call to fit only. ci_labels (list, optional) – add custom column names to the generated confidence intervals as a length-2 list: [, ]. Default:
fit_right_censoring(*args, **kwargs)

Alias for fit

fit

hazard_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted hazard at specific times.

Parameters: times (iterable or float) – values to return the hazard at. label (string, optional) – Rename the series returned. Useful for plotting.
median_survival_time_

Return the unique time point, t, such that S(t) = 0.5. This is the “half-life” of the population, and a robust summary statistic for the population, if it exists.

percentile(p) → float

Return the unique time point, t, such that S(t) = p.

Parameters: p (float)

Note

For known parametric models, this should be overwritten by something more accurate.

plot(**kwargs)

Produce a pretty-plot of the estimate.

plot_cumulative_density(**kwargs)
plot_cumulative_hazard(**kwargs)
plot_hazard(**kwargs)
plot_survival_function(**kwargs)
predict(times: Union[Iterable[float], float], interpolate=False) → pandas.core.series.Series

Predict the {0} at certain point in time. Uses a linear interpolation if points in time are not in the index.

Parameters: times (scalar, or array) – a scalar or an array of times to predict the value of {0} at. interpolate (bool, optional (default=False)) – for methods that produce a stepwise solution (Kaplan-Meier, Nelson-Aalen, etc), turning this to True will use an linear interpolation method to provide a more “smooth” answer.
print_summary(decimals=2, style=None, **kwargs)

Print summary statistics describing the fit, the coefficients, and the error bounds.

Parameters: decimals (int, optional (default=2)) – specify the number of decimal places to show style (string) – {html, ascii, latex} kwargs – print additional metadata in the output (useful to provide model names, dataset names, etc.) when comparing multiple outputs.
subtract(other) → pandas.core.frame.DataFrame

Subtract the {0} of two {1} objects.

Parameters: other (same object as self)
summary

Summary statistics describing the fit.

print_summary

survival_function_at_times(times, label=None) → pandas.core.series.Series

Return a Pandas series of the predicted survival value at specific times.

Parameters: times (iterable or float) – values to return the survival function at. label (string, optional) – Rename the series returned. Useful for plotting.

### Regression models¶

class lifelines.fitters.aalen_additive_fitter.AalenAdditiveFitter(fit_intercept=True, alpha=0.05, coef_penalizer=0.0, smoothing_penalizer=0.0)

Bases: lifelines.fitters.BaseFitter

This class fits the regression model:

$h(t|x) = b_0(t) + b_1(t) x_1 + ... + b_N(t) x_N$

that is, the hazard rate is a linear function of the covariates with time-varying coefficients. This implementation assumes non-time-varying covariates, see TODO: name

Note

This class was rewritten in lifelines 0.17.0 to focus solely on static datasets. There is no guarantee of backwards compatibility.

Parameters: fit_intercept (bool, optional (default: True)) – If False, do not attach an intercept (column of ones) to the covariate matrix. The intercept, $$b_0(t)$$ acts as a baseline hazard. alpha (float, optional (default=0.05)) – the level in the confidence intervals. coef_penalizer (float, optional (default: 0)) – Attach a L2 penalizer to the size of the coefficients during regression. This improves stability of the estimates and controls for high correlation between covariates. For example, this shrinks the magnitude of $$c_{i,t}$$. smoothing_penalizer (float, optional (default: 0)) – Attach a L2 penalizer to difference between adjacent (over time) coefficients. For example, this shrinks the magnitude of $$c_{i,t} - c_{i,t+1}$$.
cumulative_hazards_

The estimated cumulative hazard

Type: DataFrame
hazards_

The estimated hazards

Type: DataFrame
confidence_intervals_

The lower and upper confidence intervals for the cumulative hazard

Type: DataFrame
durations

The durations provided

Type: array
event_observed

The event_observed variable provided

Type: array
weights

The event_observed variable provided

Type: array
fit(df, duration_col, event_col=None, weights_col=None, show_progress=False)
Parameters: Fit the Aalen Additive model to a dataset. df (DataFrame) – a Pandas DataFrame with necessary columns duration_col and event_col (see below), covariates columns, and special columns (weights). duration_col refers to the lifetimes of the subjects. event_col refers to whether the ‘death’ events was observed: 1 if observed, 0 else (censored). duration_col (string) – the name of the column in DataFrame that contains the subjects’ lifetimes. event_col (string, optional) – the name of the column in DataFrame that contains the subjects’ death observation. If left as None, assume all individuals are uncensored. weights_col (string, optional) – an optional column in the DataFrame, df, that denotes the weight per subject. This column is expelled and not used as a covariate, but as a weight in the final regression. Default weight is 1. This can be used for case-weights. For example, a weight of 2 means there were two subjects with identical observations. This can be used for sampling weights. show_progress (bool, optional (default=False)) – Since the fitter is iterative, show iteration number. self – self with additional new properties: cumulative_hazards_, etc. AalenAdditiveFitter

Examples

>>> from lifelines import AalenAdditiveFitter
>>>
>>> df = pd.DataFrame({
>>>     'T': [5, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
>>>     'E': [1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0],
>>>     'var': [0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2],
>>>     'age': [4, 3, 9, 8, 7, 4, 4, 3, 2, 5, 6, 7],
>>> })
>>>
>>> aaf.fit(df, 'T', 'E')
>>> aaf.predict_median(df)
>>> aaf.print_summary()

fit_right_censoring(*args, **kwargs)

Alias for fit

fit

plot(columns=None, loc=None, iloc=None, ax=None, **kwargs)

” A wrapper around plotting. Matplotlib plot arguments can be passed in, plus:

Parameters: columns (string or list-like, optional) – If not empty, plot a subset of columns from the cumulative_hazards_. Default all. loc iloc (slice, optional) – specify a location-based subsection of the curves to plot, ex: .plot(iloc=slice(0,10)) will plot the first 10 time points.
predict_cumulative_hazard(X)

Returns the hazard rates for the individuals

Parameters: X (a (n,d) covariate numpy array or DataFrame. If a DataFrame, columns) – can be in any order. If a numpy array, columns must be in the same order as the training data.
predict_expectation(X)

Compute the expected lifetime, E[T], using covariates X.

Parameters: X (a (n,d) covariate numpy array or DataFrame) – If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns the expected lifetimes for the individuals
predict_median(X)
Parameters: X (a (n,d) covariate numpy array or DataFrame) – If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. Returns the median lifetimes for the individuals
predict_percentile(X, p=0.5)

Returns the median lifetimes for the individuals. http://stats.stackexchange.com/questions/102986/percentile-loss-functions

Parameters: X (a (n,d) covariate numpy array or DataFrame) – If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. p (float) – default: 0.5
predict_survival_function(X, times=None)

Returns the survival functions for the individuals

Parameters: X (a (n,d) covariate numpy array or DataFrame) – If a DataFrame, columns can be in any order. If a numpy array, columns must be in the same order as the training data. times – Not implemented yet
print_summary`(decimals=2, style=None, **kwargs)

Print summary statistics describing the fit, the coefficients, and the error bounds.

Parameters: decimals (int, optional (default=2)) – specify the number of decimal places to show style (string) – {html, ascii, latex}</