Survive

Survival Analysis in Python

Python Version PyPI Package Version Travis CI Build Status Code Coverage Documentation Status GitHub License

Survive is a Python 3 package built on top of NumPy and pandas that provides statistical tools for the analysis of survival, lifetime, and event data.

Installation

The latest version of Survive can be installed directly after cloning from GitHub.

git clone https://github.com/artemmavrin/survive.git
cd survive
make install

Moreover, Survive is on the Python Package Index (PyPI), so a recent version of it can be installed with the pip tool.

python -m pip install survive

Dependencies

Survive relies on the following scientific computing packages.

Examples Using Survive

This page lists some small case studies using the Survive package.

Open an interactive online version of this notebook: Binder badge

Leukemia Remission Times

These data are the times of remission (in weeks) of leukemia patients. Out of the 42 total patients, 21 were in a control group, and the other 21 were in a treatment group. Patients were observed until their leukemia symptoms relapsed or until the study ended, whichever occurred first. Each patient in the control group experienced relapse before the study ended, while 12 patients in the treatment group did not come out of remission during the study. Thus, there is heavy right-censoring in the treatment group and no right-censoring in the control group.

One of the questions to ask about this dataset is whether the treatment prolonged the time until relapse. Formally, we are interested in whether there is a statistical difference between the time-to-relapse distributions of the control and treatment groups. In this notebook we will use the survive package to investigate this question.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid", palette="colorblind", color_codes=True)

from survive import datasets
from survive import SurvivalData
from survive import KaplanMeier, Breslow, NelsonAalen
Loading the dataset

The leukemia() function in the survive.datasets module loads a pandas DataFrame containing the leukemia data. The columns of this DataFrame are

  • time - The patients’ observed leukemia remission times (in weeks).
  • status - Event/censoring indicator: 1 indicates that the patient’s leukemia relapsed, and 0 indicates that the study ended before relapse.
  • group - Indicates whether a patient is from the control or treatment group.
In [2]:
leukemia = datasets.leukemia()
display(leukemia.head())
time status group
patient
0 1 1 control
1 1 1 control
2 2 1 control
3 2 1 control
4 3 1 control
Exploratory data analysis with SurvivalData

The SurvivalData class is a fundamental class for storing and dealing with survival/lifetime data. It is aware of groups within the data and allows quick access to various important quantities (like the number of events or the number of individuals at risk at a certain time).

If your survival data is stored in a pandas DataFrame (like the leukemia data is), then a SurvivalData object can be created by specifying the DataFrame and the names of the columns corresponding to the observed times, censoring indicators, and group labels.

In [3]:
surv = SurvivalData(time="time", status="status", group="group", data=leukemia)

Alternatively, you may specify one-dimensional arrays of observed times, censoring indicators, and group labels directly. This is so that your can use SurvivalData even if your data aren’t stored in a DataFrame.

In [4]:
# Equivalent to the constructor call above
surv = SurvivalData(time=leukemia.time, status=leukemia.status,
                    group=leukemia.group)
Describing the data

Printing a SurvivalData object shows the observed survival times within each group. Censored times are marked by a plus by default (indicating that the true survival time for that individual might be longer).

In [5]:
print(surv)
control

 1  1  2  2  3  4  4  5  5  8  8  8  8 11 11 12 12 15 17 22 23

treatment

 6   6   6   6+  7   9+ 10  10+ 11+ 13  16  17+ 19+ 20+ 22  23  25+ 32+ 32+
34+ 35+

The describe property of a SurvivalData object is a pandas DataFrame containing simple descriptive statistics of the survival data.

In [6]:
display(surv.describe)
total events censored
group
control 21 21 0
treatment 21 9 12
Visualizing the survival data

The plot_lifetimes() method of a SurvivalData object plots the observed lifetimes of all the individuals in the data. Censored individuals are marked at the end of their lifespan.

In [7]:
plt.figure(figsize=(10, 6))
surv.plot_lifetimes()
plt.show()
plt.close()
_images/examples_Leukemia_Remission_Time_Dataset_14_0.png

There are many longer remission times observed in the treatment group. However, while this observation is encouraging, it is is not enough evidence to guarantee a statistically significance treatment effect.

Computing the number of events and number of individuals at risk

You can compute the number of events that occured at a given time within each group using the n_events() method, which returns a pandas DataFrame.

In [8]:
display(surv.n_events([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))
group control treatment
time
1 2 0
2 2 0
3 1 0
4 2 0
5 2 0
6 0 3
7 0 1
8 4 0
9 0 0
10 0 1

In a survival study, the number of individuals “at risk” at any given time is defined to be the number of individuals who have entered the study by that time and have not yet experienced an event or censoring immediately before that time. This number over time is called the at-risk process.

You can compute the number of individuals at risk within each group at a given time using the n_at_risk() method. Like n_events(), this method also returns a DataFrame.

In [9]:
display(surv.n_at_risk([0, 5, 10, 20, 25, 30, 35]))
group control treatment
time
0 0 0
5 14 21
10 8 15
20 2 8
25 0 5
30 0 4
35 0 1
Plotting the at-risk process

You can plot the at-risk process using the plot_at_risk() method of a SurvivalData object.

In [10]:
plt.figure(figsize=(10, 6))
surv.plot_at_risk()
plt.show()
plt.close()
_images/examples_Leukemia_Remission_Time_Dataset_21_0.png
Estimating the survival function with KaplanMeier
Kaplan-Meier estimator

The Kaplan-Meier estimator (AKA product limit estimator) is a nonparametric estimator of the survival function of the time-to-event distribution that can be used even in the presence of right-censoring.

The KaplanMeier class implements the Kaplan-Meier estimator.

Initializing the estimator

You can initialize a KaplanMeier object with no parameters.

In [11]:
# Kaplan-Meier estimator to be used for the leukemia data
km = KaplanMeier()

Now km is a Kaplan-Meier estimator waiting to be fitted to survival data. We didn’t pass any parameters to the initializer of the Kaplan-Meier estimator, but we could have. Printing a KaplanMeier object shows what initializer parameter values were used for that object (and default values for parameters that weren’t specified explicitly).

In [12]:
print(km)
KaplanMeier(conf_level=0.95, conf_type='log-log', n_boot=500,
            random_state=None, tie_break='discrete', var_type='greenwood')

We’ll use these default parameters.

Fitting the estimator to the leukemia data

We can fit our Kaplan-Meier estimator to the leukemia data using the fit() method. There are a few ways of doing this, but the easiest is to pass it an existing SurvivalData instance.

In [13]:
km.fit(surv)
Out[13]:
KaplanMeier(conf_level=0.95, conf_type='log-log', n_boot=500,
            random_state=None, tie_break='discrete', var_type='greenwood')

The other ways to call fit() are described in the method’s docstring. Note that fit() fits the estimator in-place and returns the estimator itself.

Summarizing the fit

Once the estimator is fitted, the summary property of a KaplanMeier object tabulates the survival probability estimates and thier standard error and confidence intervals for the event times within each group. It can be printed to display all the information at once.

In [14]:
print(km.summary)
Kaplan-Meier estimator

control

total  events  censored

   21      21         0

time  events  at risk  estimate  std. error  95% c.i. lower  95% c.i. upper
   1       2       21  0.904762    0.064056        0.670046        0.975294
   2       2       19  0.809524    0.085689        0.568905        0.923889
   3       1       17  0.761905    0.092943        0.519391        0.893257
   4       2       16  0.666667    0.102869        0.425350        0.825044
   5       2       14  0.571429    0.107990        0.337977        0.749241
   8       4       12  0.380952    0.105971        0.183067        0.577789
  11       2        8  0.285714    0.098581        0.116561        0.481820
  12       2        6  0.190476    0.085689        0.059482        0.377435
  15       1        4  0.142857    0.076360        0.035657        0.321162
  17       1        3  0.095238    0.064056        0.016259        0.261250
  22       1        2  0.047619    0.046471        0.003324        0.197045
  23       1        1  0.000000         NaN             NaN             NaN

treatment

total  events  censored

   21       9        12

time  events  at risk  estimate  std. error  95% c.i. lower  95% c.i. upper
   6       3       21  0.857143    0.076360        0.619718        0.951552
   7       1       17  0.806723    0.086935        0.563147        0.922809
  10       1       15  0.752941    0.096350        0.503200        0.889362
  13       1       12  0.690196    0.106815        0.431610        0.849066
  16       1       11  0.627451    0.114054        0.367511        0.804912
  22       1        7  0.537815    0.128234        0.267779        0.746791
  23       1        6  0.448179    0.134591        0.188052        0.680143

Note: The NaNs (not a number) appearing in the summary are caused by the standard error estimates not being defined when the survival function estimate is indentically zero. This is expected behavior.

Visualizing the fit

The estimated survival curves for the two groups can be drawn using the plot() method of the KaplanMeier object. By default, censored times in the sample are indicated by plus signs on the curve.

In [15]:
plt.figure(figsize=(10, 6))
km.plot()
plt.show()
plt.close()
_images/examples_Leukemia_Remission_Time_Dataset_36_0.png

If you prefer R-style confidence intervals (the kind drawn by plot.survfit in the survival package, for example), then you can set ci_style="lines" to get similar dashed-line confidence interval curves.

In [16]:
_, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 6))
for ax, group, color in zip(axes.flat, surv.group_labels, ("b", "g")):
    km.plot(group, ci_style="lines", color=color, ax=ax)
    ax.set(title=f"Kaplan-Meier estimator ({group} group)")
plt.tight_layout()
plt.show()
plt.close()
_images/examples_Leukemia_Remission_Time_Dataset_38_0.png

The plot seems to indicate that the patients in the treatment group are more likely to have longer remission times than patients in the control group.

Estimating survival probabilities

The predict() method of a KaplanMeier object returns a pandas DataFrame of estimated probabiltiies for surviving past a certain time for each group.

In [17]:
estimate = km.predict([5, 10, 15, 20, 25])
display(estimate)
group control treatment
time
5 0.571429 1.000000
10 0.380952 0.752941
15 0.142857 0.690196
20 0.095238 0.627451
25 0.000000 0.448179
Estimating time-to-event distribution quantiles

The quantile() function of a KaplanMeier object returns a pandas DataFrame of empirical quantile estimates for the time-to-relapse distribution

In [18]:
quantiles = km.quantile([0.25, 0.5, 0.75])
display(quantiles)
group control treatment
prob
0.25 4.0 13.0
0.50 8.0 23.0
0.75 12.0 NaN
Alternative: the Breslow estimator

The Breslow estimator is another nonparametric survival function estimator, defined as the exponential of the negative of the Nelson-Aalen cumulative hazard function estimator. It is implemented in the Breslow class, and its API is the same as the KaplanMeier API.

In [19]:
breslow = Breslow()
breslow.fit(surv)
Out[19]:
Breslow(conf_level=0.95, conf_type='log', tie_break='discrete',
        var_type='aalen')
In [20]:
print(breslow.summary)
Breslow estimator

control

total  events  censored

   21      21         0

time  events  at risk  estimate  std. error  95% c.i. lower  95% c.i. upper
   1       2       21  0.909156    0.061226        0.683312        0.976463
   2       2       19  0.818320    0.082140        0.585744        0.927595
   3       1       17  0.771572    0.089766        0.535382        0.897953
   4       2       16  0.680910    0.099488        0.445010        0.833243
   5       2       14  0.590266    0.104848        0.360455        0.761574
   8       4       12  0.422944    0.103020        0.223432        0.610118
  11       2        8  0.329389    0.099135        0.151236        0.520541
  12       2        6  0.236018    0.090224        0.088391        0.423449
  15       1        4  0.183811    0.083959        0.056501        0.368440
  17       1        3  0.131706    0.074475        0.030135        0.309303
  22       1        2  0.079884    0.060298        0.010694        0.244792
  23       1        1  0.029388    0.036820        0.000845        0.172353

treatment

total  events  censored

   21       9        12

time  events  at risk  estimate  std. error  95% c.i. lower  95% c.i. upper
   6       3       21  0.866878    0.071499        0.642147        0.954971
   7       1       17  0.817356    0.082803        0.582866        0.927417
  10       1       15  0.764642    0.092731        0.521681        0.895238
  13       1       12  0.703505    0.103518        0.449985        0.856517
  16       1       11  0.642371    0.111107        0.385958        0.814031
  22       1        7  0.556857    0.124920        0.289195        0.758612
  23       1        6  0.471369    0.131732        0.210555        0.695534
In [21]:
plt.figure(figsize=(10, 6))
breslow.plot()
plt.show()
plt.close()
_images/examples_Leukemia_Remission_Time_Dataset_47_0.png
Estimating the cumulative hazard with NelsonAalen
Nelson-Aalen estimator

The Nelson-Aalen estimator is a nonparametric estimator of the cumulative hazard of the time-to-event distribution. The NelsonAalen class implements this estimator. Its API is nearly identical to the KaplanMeier API described above.

Initializing and fitting the estimator

You can initialize a NelsonAalen object with no parameters and call the fit() methon on the leukemia data.

In [22]:
na = NelsonAalen()
na.fit(surv)
Out[22]:
NelsonAalen(conf_level=0.95, conf_type='log', tie_break='discrete',
            var_type='aalen')
Visualizing the estimated cumulative hazard

The estimated cumulative hazard for the two groups can be drawn using the plot() method of the NelsonAalen object. By default, censored times in the sample are indicated by plus signs on the curve.

In [23]:
plt.figure(figsize=(10, 6))
na.plot()
plt.show()
plt.close()
_images/examples_Leukemia_Remission_Time_Dataset_53_0.png
Open an interactive online version of this notebook: Binder badge

Channing House Data

This is the channing dataset in the R package boot. From the package description:

Channing House is a retirement centre in Palo Alto, California. These data were collected between the opening of the house in 1964 until July 1, 1975. In that time 97 men and 365 women passed through the centre. For each of these, their age on entry and also on leaving or death was recorded. A large number of the observations were censored mainly due to the resident being alive on July 1, 1975 when the data was collected. Over the time of the study 130 women and 46 men died at Channing House. Differences between the survival of the sexes, taking age into account, was one of the primary concerns of this study.”

These data feature left truncation because residents entered Channing House at different ages, and their lifetimes were not observed before entry. This is a biased sampling problem since there are no observations on individuals who died before potentially entering Channing House.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid", palette="colorblind", color_codes=True)

from survive import datasets
from survive import SurvivalData
from survive import KaplanMeier, NelsonAalen
Loading the Dataset

The channing() function in the survive.datasets module loads a pandas.DataFrame containing the Channing House data. The columns of this DataFrame are

  • sex - Sex of each resident (male or female).
  • entry - The resident’s age (in months) on entry to the centre.
  • exit - The age (in months) of the resident on death, leaving the centre or July 1, 1975 whichever event occurred first.
  • time - The length of time (in months) that the resident spent at Channing House (this is exit - entry).
  • status - Right-censoring indicator. 1 indicates that the resident died at Channing House, 0 indicates that they left the house prior to July 1, 1975 or that they were still alive and living in the centre at that date.
In [2]:
channing = datasets.channing()
channing.head()
Out[2]:
sex entry exit time status
resident
0 male 782 909 127 1
1 male 1020 1128 108 1
2 male 856 969 113 1
3 male 915 957 42 1
4 male 863 983 120 1
Exploratory Data Analysis

We use the Channing House data to create a SurvivalData object.

In [3]:
surv = SurvivalData(time="exit", entry="entry", status="status", group="sex",
                    data=channing)
print(surv)
female

 798+  804   804+  812+  819+  821+  822   824+  825+  829+  830   836+
 840   845   848+  848+  854+  857+  860+  861   861+  868   870+  870+
 872+  873   874+  875+  876+  882+  883   885   888+  891+  891+  892+
 893+  895   895+  897   897+  898+  899+  901   904+  905   905   905+
 905+  905+  906+  908   908   911   912+  912+  912+  913+  914+  915
 916+  917+  918+  919   919+  922+  923   924+  925+  926   926+  926+
 926+  927+  927+  927+  928   928+  929+  930   930+  931   932   932+
 932+  932+  933+  934   934+  936   938+  938+  938+  939+  939+  940
 940+  941   942+  943+  944   944   944+  944+  945+  946+  947+  948
 948+  948+  950+  950+  952+  952+  953+  953+  954   954+  955+  955+
 955+  957+  957+  958+  958+  959   959+  959+  960+  961+  961+  962+
 963   963+  964+  965+  965+  966   967+  969   969   970   970+  971+
 971+  973+  973+  975   975+  975+  976   976+  976+  976+  977+  977+
 978   978+  979+  979+  979+  981+  982   982   982+  983   983+  984+
 985+  985+  985+  985+  986   986+  987+  988+  988+  989   989+  989+
 990   990   990   990   991   991+  992   992+  992+  993+  993+  994
 994   994+  994+  995   995   995+  996   996   996   996+  996+  997+
 998   998+  999  1000  1000+ 1001  1001+ 1001+ 1003  1003+ 1004  1004+
1005  1005+ 1005+ 1006  1006  1006+ 1006+ 1006+ 1008+ 1008+ 1009+ 1010
1010+ 1010+ 1010+ 1011  1011+ 1011+ 1012  1012  1012+ 1012+ 1012+ 1013
1013+ 1014  1014+ 1014+ 1014+ 1015  1015+ 1015+ 1016+ 1017  1018  1018
1019  1019  1019+ 1019+ 1019+ 1019+ 1020  1020+ 1020+ 1021  1022+ 1023
1023  1023+ 1023+ 1024  1024+ 1027  1028+ 1029  1029+ 1030  1030+ 1031+
1031+ 1032+ 1033  1033+ 1034+ 1035+ 1036+ 1036+ 1037+ 1038+ 1040  1040
1040  1040  1041  1041  1041  1042+ 1042+ 1043  1043+ 1044  1044+ 1047
1047+ 1049+ 1050+ 1051+ 1053+ 1054+ 1054+ 1055+ 1056  1056  1057+ 1059+
1061+ 1062+ 1063  1063+ 1064  1065+ 1068  1068  1070  1071+ 1071+ 1072
1072+ 1073  1073+ 1074  1080+ 1083  1084  1085  1085  1085+ 1086  1088+
1089  1091+ 1093  1097  1102+ 1105+ 1105+ 1109+ 1114+ 1115  1119+ 1122
1131  1132  1134+ 1142  1147+ 1152  1172  1172+ 1186+ 1192  1200  1200
1207+

male

 777   781   843+  866+  869   872   876   893   894   895+  898   906+
 907   909   911   911+  914+  927   932   936+  940+  943+  943+  945
 945+  948   951+  956+  957   957+  959+  960+  966   966+  969   970+
 971   972+  973+  977+  983   984+  985   989   993   993   996+  998
1001+ 1002+ 1005+ 1006+ 1009  1012  1012  1012+ 1013+ 1015+ 1016+ 1018+
1022  1023+ 1025  1027+ 1029  1031  1031+ 1031+ 1033  1036  1043  1043+
1044  1044+ 1045+ 1047+ 1053  1055  1058+ 1059  1060  1060+ 1064+ 1070+
1073+ 1080  1085  1093+ 1094  1094  1106+ 1107+ 1118+ 1128  1139  1153+
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/survive-0.3-py3.7.egg/survive/survival_data.py:183: RuntimeWarning: Ignoring 5 observations where entry >= time.
  RuntimeWarning)

The warning here is due to the fact that five of the observations in the data have entry times that are the same or later than the corresponding final times. These observations consequently cannot be used.

In [4]:
# The five observations responsible for the warning above
display(channing.loc[channing.exit <= channing.entry])
sex entry exit time status
resident
56 male 953 953 0 0
351 female 957 957 0 0
372 female 944 944 0 0
373 female 935 935 0 0
433 female 959 912 53 1

The last line above doesn’t even make sense since the entry time (959) is greater than the exit time (912). The resident’s duration is 53, which suggests that the entry time is likely supposed to be 859 (after all, 859+53=912). Nevertheless, we will ignore this observation.

In [5]:
display(surv.describe)
total events censored
group
female 361 129 232
male 96 46 50
Plotting the At-Risk Process

Due to the left-truncation, the risk set size initially increases as residents enter Channing House and then decreases as residents die or are censored.

In [6]:
plt.figure(figsize=(10, 6))
colors = dict(female="b", male="g")
surv.plot_at_risk(colors=colors)
plt.show()
plt.close()
_images/examples_Channing_House_Dataset_11_0.png
Estimating the survival function

The out-of-the-box Kaplan-Meier estimator does a poor job estimating the survival function for the Channing House data because risk set of the male residents is zero very early on before growing. This causes the survival function estimate to be zero for most of the observed times, which is clearly wrong. We will show the problem graphically in this case and then discuss how to fix it.

In [7]:
km = KaplanMeier().fit(surv)

plt.figure(figsize=(10, 6))
km.plot(colors=colors)
plt.show()
plt.close()
_images/examples_Channing_House_Dataset_14_0.png

One way to address this issue is to condition on survival up to a later time, say 68 or 80 years (816 or 960 months).

In [8]:
km68 = KaplanMeier()
km68.fit(time="exit", entry="entry", status="status", group="sex",
         data=channing, min_time=816, warn=False)
Out[8]:
KaplanMeier(conf_level=0.95, conf_type='log-log', n_boot=500,
            random_state=None, tie_break='discrete', var_type='greenwood')
In [9]:
_, ax = plt.subplots(nrows=2, ncols=1, figsize=(10, 12))

km68.plot("female", color=colors["female"], ax=ax[0])
ax[0].set(title="Female Channing House residents (68 years or older)")
ax[0].set(xlabel="Age (months)")

km68.plot("male", color=colors["male"], ax=ax[1])
ax[1].set(title="Male Channing House residents (68 years or older)")
ax[1].set(xlabel="Age (months)")

plt.show()
plt.close()
_images/examples_Channing_House_Dataset_17_0.png
In [10]:
km80 = KaplanMeier()
km80.fit(time="exit", entry="entry", status="status", group="sex",
         data=channing, min_time=960)
Out[10]:
KaplanMeier(conf_level=0.95, conf_type='log-log', n_boot=500,
            random_state=None, tie_break='discrete', var_type='greenwood')
In [11]:
_, ax = plt.subplots(nrows=2, ncols=1, figsize=(10, 12))

km80.plot("female", color=colors["female"], ax=ax[0])
ax[0].set(title="Female Channing House residents (80 years or older)")
ax[0].set(xlabel="Age (months)")

km80.plot("male", color=colors["male"], ax=ax[1])
ax[1].set(title="Male Channing House residents (80 years or older)")
ax[1].set(xlabel="Age (months)")

plt.show()
plt.close()
_images/examples_Channing_House_Dataset_19_0.png
Estimating the cumulative hazard
In [12]:
na = NelsonAalen(var_type="greenwood")
na.fit(surv)
Out[12]:
NelsonAalen(conf_level=0.95, conf_type='log', tie_break='discrete',
            var_type='greenwood')
In [13]:
plt.figure(figsize=(10, 6))
na.plot(colors=colors)
plt.show()
plt.close()
_images/examples_Channing_House_Dataset_22_0.png

Leukemia Remission Times

This is a small dataset (42 total observations) separated into two groups with heavy right-censoring in one and none in the other.

Channing House Data

This is a slightly larger dataset (462 total observations) separated into two groups with right-censoring and left truncation in both.

Background and Theory

This page aims to quickly cover some basic survival analysis.

Note

This page is under construction.

Survival Analysis Setup

Suppose we are interested in a particular time-to-event distribution (e.g., time until death, time until disease occurrence or recovery, or time until failure in a mechanical system). The phrase “time-to-event” can mean nearly any positive quantity being measured, not necessarily time.

The time-to-event distribution is completely determined by its survival function \(S(t) = \Pr(X > t)\), where \(X\) is the time-to-event (a positive random variable), and \(t\) is a positive time. The survival function might be of interest itself since it answers questions like, “what is the probability that a cancer patient will survive at least five more years?” or “what is the probability that this machine part won’t break in the next six months?”

Suppose we have \(n\) individuals with independent and identically distributed event \(X_1, \ldots, X_n\) with survival function \(S\). If we could observe \(X_1, \ldots, X_n\), then one obvious candidate for estimating \(S(t)\) is the empirical survival function:

\[\widehat{S}(t) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}_{\{X_i > t\}}.\]

This is just the number of individuals observed to survive past time \(t\) divided by the sample size \(n\). (Here \(\mathbf{1}_A\) denotes the indicator of an event \(A\).) This unbiased, consistent, and asymptotically normal estimator is nevertheless not suitable for many situations of interest in survival analysis. The main problem is that it relies on all the event times \(X_1,\ldots,X_n\) being observable.

Censoring and Truncation

It is common in practice that not all the times \(X_1, \ldots, X_n\) are observed. We list some possible examples.

  • A clinical trial investigating a disease treatment might end before a patient recovers from the disease, in which case the patient’s true recovery time is not known.
  • A patient whose time until death from cancer is being monitored might die from a different disease, so the patient’s death-from-cancer time will not be known.
  • In engineering, a reliability experiment might be stopped after a predetermined number of parts fail. A part that is still operational after this time will not have its failure time observed.

In these cases, the true time-to-event is not known, but a lower bound for it is available. This situation is called right-censoring.

Another source of incomplete information is left-truncation (also known as delayed entry), in which an individual’s time-to-event may only be observed if it exceeds a certain time. Some examples:

  • Patients for a certain disease might only be observed after the disease has been diagnosed. A patient who died from the disease before diagnosis is unknown to investigators.
  • If we are observing the age at death of residents in a retirement home, then we do not observe the ages at death of individuals who died before becoming residents.
  • If we are measuring the diameters of particles with a microscope, then only particles large enough to be detected by the microscope will be observed.

There are other types of censoring and truncation, but for now we will focus on right-censoring and left-truncation, the most common variants.

When there is right-censoring or left-truncation, the empirical survival function is not a suitable estimator of the survival function \(S\) because the event times \(X_1, \ldots, X_n\) are not known. A popular alternative is the Kaplan-Meier estimator, which we will define below.

Let us first formalize the sampling situation with right-censoring and left-truncation. Suppose as before that we have \(n\) individuals with with independent and identically distributed times-to-event \(X_1, \ldots, X_n\) with survival function \(S\). Moreover, suppose we observe a sample \((L_1, X_1^\prime, \delta_1), \ldots, (L_n, X_n^\prime, \delta_n)\), where \(L_i\) is the entry time of the \(i\)-th individual, \(X_i^\prime\) is the observed final time for the \(i\)-th individual (so \(X_i^\prime > L_i\)), and \(\delta_i\) is zero or one according to whether \(X_i^\prime < X_i\) (the \(i\)-th individual is censored) or \(X_i^\prime = X_i\) (the \(i\)-th individual’s time-to-event is observed).

The Counting Process Formulation

We define two stochastic processes associated with this sample: the event counting process

\[N(t) = \sum_{i=1}^n \mathbf{1}_{\{X_i^\prime \leq t, \delta_i = 1\}},\]

and the at-risk process

\[Y(t) = \sum_{i=1}^n \mathbf{1}_{\{L_i < t \leq X_i^\prime\}}.\]

The event counting process counts the number of events that have been observed to occur up to a certain time, and the at-risk process counts the number of individuals who are “at risk” (those who are already being observed but have not yet been censored or experienced the event of interest) at a certain time.

The trajectories (i.e., sample paths) of the event counting process are right-continuous with left limits (commonly called càdlàg). In fact, the event counting process is non-decreasing and piecewise constant with at most \(n\) jumps (the jumps happen at observed event times). Let \(0 < T_1 < T_2 < \cdots\) denote the ordered jump times. The size of the jump at time \(T_j\) is

\[\Delta N(T_j) = N(T_j) - N(T_j-),\]

where for \(t > 0\),

\[N(t-) = \lim_{s \uparrow t} N(s)\]

is a limit from the left. In other words, \(\Delta N(T_j)\) is the number of observed events at time \(T_j\). For convenience, also define \(T_0 = 0\).

Stochastic integrals with respect to \(N\) can be easily computed as sums:

\[\int_0^t H(s) \, dN(s) = \sum_{j : T_j \leq t} H(T_j) \Delta N(T_j).\]

The Kaplan-Meier Estimator

We will now give a derivation of the Kaplan-Meier survival function estimator. This will be an estimator of the survival function \(S(t) = \Pr(X > t)\), where \(X\) is the time-to-event, based on the right-censored and left-truncated sample \((L_1, X_1^\prime, \delta_1), \ldots, (L_n, X_n^\prime, \delta_n)\) described above.

First, observe that if we have times \(s < t\), then

\[\begin{split}S(t) &= \Pr(X > t) \\ &= \Pr(X > s) \Pr(X > t \mid X > s) \\ &= S(s) \Pr(X > t \mid X > s).\end{split}\]

Thus, it suffices to estimate the conditional probability \(\lambda(s, t) = \Pr(X > t \mid X > s)\) of surviving the time interval \((s, t]\) given survival up to time \(s\).

Todo

Finish this section.

API Reference

This page lists all the public classes and functions of Survive.

survive: Top-Level Module

Survival analysis in Python.

Classes
Breslow(*[, conf_type, conf_level, …]) Breslow nonparametric survival function estimator.
KaplanMeier(*[, conf_type, conf_level, …]) Kaplan-Meier nonparametric survival function estimator.
NelsonAalen(*[, conf_type, conf_level, …]) Nelson-Aalen nonparametric cumulative hazard estimator.
SurvivalData(time, *[, status, entry, …]) Class representing right-censored and left-truncated survival data.

survive.base: Base Classes

Base classes for survival analysis objects.

Classes
base.Fittable Abstract mixin class for models implementing a fit() method.
base.Model Abstract base class for survival models.
base.Predictor Abstract mixin class for models implementing a predict() method.
base.Summary(model) Base class for summaries of survival models (intended for subclassing).

survive.datasets: Survival Datasets

Functions for loading some survival datasets.

Functions
datasets.channing() Load the Channing House dataset.
datasets.leukemia() Load the leukemia dataset.

survive.nonparametric: Nonparametric Estimation

Nonparametric estimators.

Classes
nonparametric.NonparametricEstimator Abstract base class for nonparametric estimators.
nonparametric.NonparametricSurvival Abstract base class for nonparametric survival function estimators.
nonparametric.NonparametricEstimatorSummary(model) Summaries for nonparametric estimators.

survive.utils: Utility Functions

Miscellaneous utility functions.

Functions
utils.add_legend(ax[, legend_kwargs]) Add a legend to a plot.
utils.check_bool(tf, *[, allow_none]) Validate boolean function arguments.
utils.check_colors(colors, *[, n_colors, …]) Validate colors for plotting.
utils.check_data_1d(data, *[, numeric, …]) Preprocess and validate a one-dimensional array.
utils.check_data_2d(data, *[, numeric, …]) Preprocess and validate a two-dimensional array (i.e., a matrix).
utils.check_float(num, *[, positive, …]) Validate floating-point number function arguments.
utils.check_int(num, *[, minimum, maximum, …]) Validate integer function arguments.
utils.check_random_state(random_state) Validate random number generators and seeds.

Indices and tables