Survive¶
Survival Analysis in Python¶
Survive is a Python 3 package built on top of NumPy and pandas that provides statistical tools for the analysis of survival, lifetime, and event data.
Installation¶
The latest version of Survive can be installed directly after cloning from GitHub.
git clone https://github.com/artemmavrin/survive.git
cd survive
make install
Moreover, Survive is on the Python Package Index (PyPI), so a recent version of it can be installed with the pip tool.
python -m pip install survive
Dependencies¶
Survive relies on the following scientific computing packages.
Examples Using Survive¶
This page lists some small case studies using the Survive package.
Leukemia Remission Times¶
These data are the times of remission (in weeks) of leukemia patients. Out of the 42 total patients, 21 were in a control group, and the other 21 were in a treatment group. Patients were observed until their leukemia symptoms relapsed or until the study ended, whichever occurred first. Each patient in the control group experienced relapse before the study ended, while 12 patients in the treatment group did not come out of remission during the study. Thus, there is heavy right-censoring in the treatment group and no right-censoring in the control group.
One of the questions to ask about this dataset is whether the treatment
prolonged the time until relapse. Formally, we are interested in whether
there is a statistical difference between the time-to-relapse
distributions of the control and treatment groups. In this notebook we
will use the survive
package to investigate this question.
In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid", palette="colorblind", color_codes=True)
from survive import datasets
from survive import SurvivalData
from survive import KaplanMeier, Breslow, NelsonAalen
Loading the dataset¶
The leukemia()
function in the survive.datasets
module loads a
pandas DataFrame containing the leukemia data. The columns of this
DataFrame are
time
- The patients’ observed leukemia remission times (in weeks).status
- Event/censoring indicator: 1 indicates that the patient’s leukemia relapsed, and 0 indicates that the study ended before relapse.group
- Indicates whether a patient is from the control or treatment group.
In [2]:
leukemia = datasets.leukemia()
display(leukemia.head())
time | status | group | |
---|---|---|---|
patient | |||
0 | 1 | 1 | control |
1 | 1 | 1 | control |
2 | 2 | 1 | control |
3 | 2 | 1 | control |
4 | 3 | 1 | control |
Exploratory data analysis with SurvivalData
¶
The SurvivalData
class is a fundamental class for storing and
dealing with survival/lifetime data. It is aware of groups within the
data and allows quick access to various important quantities (like the
number of events or the number of individuals at risk at a certain
time).
If your survival data is stored in a pandas DataFrame (like the leukemia
data is), then a SurvivalData
object can be created by specifying
the DataFrame and the names of the columns corresponding to the observed
times, censoring indicators, and group labels.
In [3]:
surv = SurvivalData(time="time", status="status", group="group", data=leukemia)
Alternatively, you may specify one-dimensional arrays of observed times,
censoring indicators, and group labels directly. This is so that your
can use SurvivalData
even if your data aren’t stored in a DataFrame.
In [4]:
# Equivalent to the constructor call above
surv = SurvivalData(time=leukemia.time, status=leukemia.status,
group=leukemia.group)
Describing the data¶
Printing a SurvivalData
object shows the observed survival times
within each group. Censored times are marked by a plus by default
(indicating that the true survival time for that individual might be
longer).
In [5]:
print(surv)
control
1 1 2 2 3 4 4 5 5 8 8 8 8 11 11 12 12 15 17 22 23
treatment
6 6 6 6+ 7 9+ 10 10+ 11+ 13 16 17+ 19+ 20+ 22 23 25+ 32+ 32+
34+ 35+
The describe
property of a SurvivalData
object is a pandas
DataFrame containing simple descriptive statistics of the survival data.
In [6]:
display(surv.describe)
total | events | censored | |
---|---|---|---|
group | |||
control | 21 | 21 | 0 |
treatment | 21 | 9 | 12 |
Visualizing the survival data¶
The plot_lifetimes()
method of a SurvivalData
object plots the
observed lifetimes of all the individuals in the data. Censored
individuals are marked at the end of their lifespan.
In [7]:
plt.figure(figsize=(10, 6))
surv.plot_lifetimes()
plt.show()
plt.close()

There are many longer remission times observed in the treatment group. However, while this observation is encouraging, it is is not enough evidence to guarantee a statistically significance treatment effect.
Computing the number of events and number of individuals at risk¶
You can compute the number of events that occured at a given time within
each group using the n_events()
method, which returns a pandas
DataFrame.
In [8]:
display(surv.n_events([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))
group | control | treatment |
---|---|---|
time | ||
1 | 2 | 0 |
2 | 2 | 0 |
3 | 1 | 0 |
4 | 2 | 0 |
5 | 2 | 0 |
6 | 0 | 3 |
7 | 0 | 1 |
8 | 4 | 0 |
9 | 0 | 0 |
10 | 0 | 1 |
In a survival study, the number of individuals “at risk” at any given time is defined to be the number of individuals who have entered the study by that time and have not yet experienced an event or censoring immediately before that time. This number over time is called the at-risk process.
You can compute the number of individuals at risk within each group at a
given time using the n_at_risk()
method. Like n_events()
, this
method also returns a DataFrame.
In [9]:
display(surv.n_at_risk([0, 5, 10, 20, 25, 30, 35]))
group | control | treatment |
---|---|---|
time | ||
0 | 0 | 0 |
5 | 14 | 21 |
10 | 8 | 15 |
20 | 2 | 8 |
25 | 0 | 5 |
30 | 0 | 4 |
35 | 0 | 1 |
Plotting the at-risk process¶
You can plot the at-risk process using the plot_at_risk()
method of
a SurvivalData
object.
In [10]:
plt.figure(figsize=(10, 6))
surv.plot_at_risk()
plt.show()
plt.close()

Estimating the survival function with KaplanMeier
¶
Kaplan-Meier estimator¶
The Kaplan-Meier estimator (AKA product limit estimator) is a nonparametric estimator of the survival function of the time-to-event distribution that can be used even in the presence of right-censoring.
The KaplanMeier
class implements the Kaplan-Meier estimator.
Initializing the estimator¶
You can initialize a KaplanMeier
object with no parameters.
In [11]:
# Kaplan-Meier estimator to be used for the leukemia data
km = KaplanMeier()
Now km
is a Kaplan-Meier estimator waiting to be fitted to survival
data. We didn’t pass any parameters to the initializer of the
Kaplan-Meier estimator, but we could have. Printing a KaplanMeier
object shows what initializer parameter values were used for that object
(and default values for parameters that weren’t specified explicitly).
In [12]:
print(km)
KaplanMeier(conf_level=0.95, conf_type='log-log', n_boot=500,
random_state=None, tie_break='discrete', var_type='greenwood')
We’ll use these default parameters.
Fitting the estimator to the leukemia data¶
We can fit our Kaplan-Meier estimator to the leukemia data using the
fit()
method. There are a few ways of doing this, but the easiest is
to pass it an existing SurvivalData
instance.
In [13]:
km.fit(surv)
Out[13]:
KaplanMeier(conf_level=0.95, conf_type='log-log', n_boot=500,
random_state=None, tie_break='discrete', var_type='greenwood')
The other ways to call fit()
are described in the method’s
docstring. Note that fit()
fits the estimator in-place and returns
the estimator itself.
Summarizing the fit¶
Once the estimator is fitted, the summary
property of a
KaplanMeier
object tabulates the survival probability estimates and
thier standard error and confidence intervals for the event times within
each group. It can be printed to display all the information at once.
In [14]:
print(km.summary)
Kaplan-Meier estimator
control
total events censored
21 21 0
time events at risk estimate std. error 95% c.i. lower 95% c.i. upper
1 2 21 0.904762 0.064056 0.670046 0.975294
2 2 19 0.809524 0.085689 0.568905 0.923889
3 1 17 0.761905 0.092943 0.519391 0.893257
4 2 16 0.666667 0.102869 0.425350 0.825044
5 2 14 0.571429 0.107990 0.337977 0.749241
8 4 12 0.380952 0.105971 0.183067 0.577789
11 2 8 0.285714 0.098581 0.116561 0.481820
12 2 6 0.190476 0.085689 0.059482 0.377435
15 1 4 0.142857 0.076360 0.035657 0.321162
17 1 3 0.095238 0.064056 0.016259 0.261250
22 1 2 0.047619 0.046471 0.003324 0.197045
23 1 1 0.000000 NaN NaN NaN
treatment
total events censored
21 9 12
time events at risk estimate std. error 95% c.i. lower 95% c.i. upper
6 3 21 0.857143 0.076360 0.619718 0.951552
7 1 17 0.806723 0.086935 0.563147 0.922809
10 1 15 0.752941 0.096350 0.503200 0.889362
13 1 12 0.690196 0.106815 0.431610 0.849066
16 1 11 0.627451 0.114054 0.367511 0.804912
22 1 7 0.537815 0.128234 0.267779 0.746791
23 1 6 0.448179 0.134591 0.188052 0.680143
Note: The NaNs (not a number) appearing in the summary are caused by the standard error estimates not being defined when the survival function estimate is indentically zero. This is expected behavior.
Visualizing the fit¶
The estimated survival curves for the two groups can be drawn using the
plot()
method of the KaplanMeier
object. By default, censored
times in the sample are indicated by plus signs on the curve.
In [15]:
plt.figure(figsize=(10, 6))
km.plot()
plt.show()
plt.close()

If you prefer R-style confidence intervals (the kind drawn by
plot.survfit
in the survival
package, for example), then you can
set ci_style="lines"
to get similar dashed-line confidence interval
curves.
In [16]:
_, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 6))
for ax, group, color in zip(axes.flat, surv.group_labels, ("b", "g")):
km.plot(group, ci_style="lines", color=color, ax=ax)
ax.set(title=f"Kaplan-Meier estimator ({group} group)")
plt.tight_layout()
plt.show()
plt.close()

The plot seems to indicate that the patients in the treatment group are more likely to have longer remission times than patients in the control group.
Estimating survival probabilities¶
The predict()
method of a KaplanMeier
object returns a pandas
DataFrame of estimated probabiltiies for surviving past a certain time
for each group.
In [17]:
estimate = km.predict([5, 10, 15, 20, 25])
display(estimate)
group | control | treatment |
---|---|---|
time | ||
5 | 0.571429 | 1.000000 |
10 | 0.380952 | 0.752941 |
15 | 0.142857 | 0.690196 |
20 | 0.095238 | 0.627451 |
25 | 0.000000 | 0.448179 |
Estimating time-to-event distribution quantiles¶
The quantile()
function of a KaplanMeier
object returns a pandas
DataFrame of empirical quantile estimates for the time-to-relapse
distribution
In [18]:
quantiles = km.quantile([0.25, 0.5, 0.75])
display(quantiles)
group | control | treatment |
---|---|---|
prob | ||
0.25 | 4.0 | 13.0 |
0.50 | 8.0 | 23.0 |
0.75 | 12.0 | NaN |
Alternative: the Breslow estimator¶
The Breslow estimator is another nonparametric survival function
estimator, defined as the exponential of the negative of the
Nelson-Aalen cumulative hazard function estimator. It is implemented in
the Breslow
class, and its API is the same as the KaplanMeier
API.
In [19]:
breslow = Breslow()
breslow.fit(surv)
Out[19]:
Breslow(conf_level=0.95, conf_type='log', tie_break='discrete',
var_type='aalen')
In [20]:
print(breslow.summary)
Breslow estimator
control
total events censored
21 21 0
time events at risk estimate std. error 95% c.i. lower 95% c.i. upper
1 2 21 0.909156 0.061226 0.683312 0.976463
2 2 19 0.818320 0.082140 0.585744 0.927595
3 1 17 0.771572 0.089766 0.535382 0.897953
4 2 16 0.680910 0.099488 0.445010 0.833243
5 2 14 0.590266 0.104848 0.360455 0.761574
8 4 12 0.422944 0.103020 0.223432 0.610118
11 2 8 0.329389 0.099135 0.151236 0.520541
12 2 6 0.236018 0.090224 0.088391 0.423449
15 1 4 0.183811 0.083959 0.056501 0.368440
17 1 3 0.131706 0.074475 0.030135 0.309303
22 1 2 0.079884 0.060298 0.010694 0.244792
23 1 1 0.029388 0.036820 0.000845 0.172353
treatment
total events censored
21 9 12
time events at risk estimate std. error 95% c.i. lower 95% c.i. upper
6 3 21 0.866878 0.071499 0.642147 0.954971
7 1 17 0.817356 0.082803 0.582866 0.927417
10 1 15 0.764642 0.092731 0.521681 0.895238
13 1 12 0.703505 0.103518 0.449985 0.856517
16 1 11 0.642371 0.111107 0.385958 0.814031
22 1 7 0.556857 0.124920 0.289195 0.758612
23 1 6 0.471369 0.131732 0.210555 0.695534
In [21]:
plt.figure(figsize=(10, 6))
breslow.plot()
plt.show()
plt.close()

Estimating the cumulative hazard with NelsonAalen
¶
Nelson-Aalen estimator¶
The Nelson-Aalen estimator is a nonparametric estimator of the
cumulative hazard of the time-to-event distribution. The NelsonAalen
class implements this estimator. Its API is nearly identical to the
KaplanMeier
API described above.
Initializing and fitting the estimator¶
You can initialize a NelsonAalen
object with no parameters and call
the fit()
methon on the leukemia data.
In [22]:
na = NelsonAalen()
na.fit(surv)
Out[22]:
NelsonAalen(conf_level=0.95, conf_type='log', tie_break='discrete',
var_type='aalen')
Visualizing the estimated cumulative hazard¶
The estimated cumulative hazard for the two groups can be drawn using
the plot()
method of the NelsonAalen
object. By default,
censored times in the sample are indicated by plus signs on the curve.
In [23]:
plt.figure(figsize=(10, 6))
na.plot()
plt.show()
plt.close()

Channing House Data¶
This is the channing
dataset in the R package
boot. From
the package description:
Channing House is a retirement centre in Palo Alto, California. These data were collected between the opening of the house in 1964 until July 1, 1975. In that time 97 men and 365 women passed through the centre. For each of these, their age on entry and also on leaving or death was recorded. A large number of the observations were censored mainly due to the resident being alive on July 1, 1975 when the data was collected. Over the time of the study 130 women and 46 men died at Channing House. Differences between the survival of the sexes, taking age into account, was one of the primary concerns of this study.”
These data feature left truncation because residents entered Channing House at different ages, and their lifetimes were not observed before entry. This is a biased sampling problem since there are no observations on individuals who died before potentially entering Channing House.
In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid", palette="colorblind", color_codes=True)
from survive import datasets
from survive import SurvivalData
from survive import KaplanMeier, NelsonAalen
Loading the Dataset¶
The channing()
function in the survive.datasets
module loads a
pandas.DataFrame
containing the Channing House data. The columns of
this DataFrame
are
sex
- Sex of each resident (male or female).entry
- The resident’s age (in months) on entry to the centre.exit
- The age (in months) of the resident on death, leaving the centre or July 1, 1975 whichever event occurred first.time
- The length of time (in months) that the resident spent at Channing House (this isexit - entry
).status
- Right-censoring indicator. 1 indicates that the resident died at Channing House, 0 indicates that they left the house prior to July 1, 1975 or that they were still alive and living in the centre at that date.
In [2]:
channing = datasets.channing()
channing.head()
Out[2]:
sex | entry | exit | time | status | |
---|---|---|---|---|---|
resident | |||||
0 | male | 782 | 909 | 127 | 1 |
1 | male | 1020 | 1128 | 108 | 1 |
2 | male | 856 | 969 | 113 | 1 |
3 | male | 915 | 957 | 42 | 1 |
4 | male | 863 | 983 | 120 | 1 |
Exploratory Data Analysis¶
We use the Channing House data to create a SurvivalData
object.
In [3]:
surv = SurvivalData(time="exit", entry="entry", status="status", group="sex",
data=channing)
print(surv)
female
798+ 804 804+ 812+ 819+ 821+ 822 824+ 825+ 829+ 830 836+
840 845 848+ 848+ 854+ 857+ 860+ 861 861+ 868 870+ 870+
872+ 873 874+ 875+ 876+ 882+ 883 885 888+ 891+ 891+ 892+
893+ 895 895+ 897 897+ 898+ 899+ 901 904+ 905 905 905+
905+ 905+ 906+ 908 908 911 912+ 912+ 912+ 913+ 914+ 915
916+ 917+ 918+ 919 919+ 922+ 923 924+ 925+ 926 926+ 926+
926+ 927+ 927+ 927+ 928 928+ 929+ 930 930+ 931 932 932+
932+ 932+ 933+ 934 934+ 936 938+ 938+ 938+ 939+ 939+ 940
940+ 941 942+ 943+ 944 944 944+ 944+ 945+ 946+ 947+ 948
948+ 948+ 950+ 950+ 952+ 952+ 953+ 953+ 954 954+ 955+ 955+
955+ 957+ 957+ 958+ 958+ 959 959+ 959+ 960+ 961+ 961+ 962+
963 963+ 964+ 965+ 965+ 966 967+ 969 969 970 970+ 971+
971+ 973+ 973+ 975 975+ 975+ 976 976+ 976+ 976+ 977+ 977+
978 978+ 979+ 979+ 979+ 981+ 982 982 982+ 983 983+ 984+
985+ 985+ 985+ 985+ 986 986+ 987+ 988+ 988+ 989 989+ 989+
990 990 990 990 991 991+ 992 992+ 992+ 993+ 993+ 994
994 994+ 994+ 995 995 995+ 996 996 996 996+ 996+ 997+
998 998+ 999 1000 1000+ 1001 1001+ 1001+ 1003 1003+ 1004 1004+
1005 1005+ 1005+ 1006 1006 1006+ 1006+ 1006+ 1008+ 1008+ 1009+ 1010
1010+ 1010+ 1010+ 1011 1011+ 1011+ 1012 1012 1012+ 1012+ 1012+ 1013
1013+ 1014 1014+ 1014+ 1014+ 1015 1015+ 1015+ 1016+ 1017 1018 1018
1019 1019 1019+ 1019+ 1019+ 1019+ 1020 1020+ 1020+ 1021 1022+ 1023
1023 1023+ 1023+ 1024 1024+ 1027 1028+ 1029 1029+ 1030 1030+ 1031+
1031+ 1032+ 1033 1033+ 1034+ 1035+ 1036+ 1036+ 1037+ 1038+ 1040 1040
1040 1040 1041 1041 1041 1042+ 1042+ 1043 1043+ 1044 1044+ 1047
1047+ 1049+ 1050+ 1051+ 1053+ 1054+ 1054+ 1055+ 1056 1056 1057+ 1059+
1061+ 1062+ 1063 1063+ 1064 1065+ 1068 1068 1070 1071+ 1071+ 1072
1072+ 1073 1073+ 1074 1080+ 1083 1084 1085 1085 1085+ 1086 1088+
1089 1091+ 1093 1097 1102+ 1105+ 1105+ 1109+ 1114+ 1115 1119+ 1122
1131 1132 1134+ 1142 1147+ 1152 1172 1172+ 1186+ 1192 1200 1200
1207+
male
777 781 843+ 866+ 869 872 876 893 894 895+ 898 906+
907 909 911 911+ 914+ 927 932 936+ 940+ 943+ 943+ 945
945+ 948 951+ 956+ 957 957+ 959+ 960+ 966 966+ 969 970+
971 972+ 973+ 977+ 983 984+ 985 989 993 993 996+ 998
1001+ 1002+ 1005+ 1006+ 1009 1012 1012 1012+ 1013+ 1015+ 1016+ 1018+
1022 1023+ 1025 1027+ 1029 1031 1031+ 1031+ 1033 1036 1043 1043+
1044 1044+ 1045+ 1047+ 1053 1055 1058+ 1059 1060 1060+ 1064+ 1070+
1073+ 1080 1085 1093+ 1094 1094 1106+ 1107+ 1118+ 1128 1139 1153+
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/survive-0.3-py3.7.egg/survive/survival_data.py:183: RuntimeWarning: Ignoring 5 observations where entry >= time.
RuntimeWarning)
The warning here is due to the fact that five of the observations in the data have entry times that are the same or later than the corresponding final times. These observations consequently cannot be used.
In [4]:
# The five observations responsible for the warning above
display(channing.loc[channing.exit <= channing.entry])
sex | entry | exit | time | status | |
---|---|---|---|---|---|
resident | |||||
56 | male | 953 | 953 | 0 | 0 |
351 | female | 957 | 957 | 0 | 0 |
372 | female | 944 | 944 | 0 | 0 |
373 | female | 935 | 935 | 0 | 0 |
433 | female | 959 | 912 | 53 | 1 |
The last line above doesn’t even make sense since the entry time (959) is greater than the exit time (912). The resident’s duration is 53, which suggests that the entry time is likely supposed to be 859 (after all, 859+53=912). Nevertheless, we will ignore this observation.
In [5]:
display(surv.describe)
total | events | censored | |
---|---|---|---|
group | |||
female | 361 | 129 | 232 |
male | 96 | 46 | 50 |
Plotting the At-Risk Process¶
Due to the left-truncation, the risk set size initially increases as residents enter Channing House and then decreases as residents die or are censored.
In [6]:
plt.figure(figsize=(10, 6))
colors = dict(female="b", male="g")
surv.plot_at_risk(colors=colors)
plt.show()
plt.close()

Estimating the survival function¶
The out-of-the-box Kaplan-Meier estimator does a poor job estimating the survival function for the Channing House data because risk set of the male residents is zero very early on before growing. This causes the survival function estimate to be zero for most of the observed times, which is clearly wrong. We will show the problem graphically in this case and then discuss how to fix it.
In [7]:
km = KaplanMeier().fit(surv)
plt.figure(figsize=(10, 6))
km.plot(colors=colors)
plt.show()
plt.close()

One way to address this issue is to condition on survival up to a later time, say 68 or 80 years (816 or 960 months).
In [8]:
km68 = KaplanMeier()
km68.fit(time="exit", entry="entry", status="status", group="sex",
data=channing, min_time=816, warn=False)
Out[8]:
KaplanMeier(conf_level=0.95, conf_type='log-log', n_boot=500,
random_state=None, tie_break='discrete', var_type='greenwood')
In [9]:
_, ax = plt.subplots(nrows=2, ncols=1, figsize=(10, 12))
km68.plot("female", color=colors["female"], ax=ax[0])
ax[0].set(title="Female Channing House residents (68 years or older)")
ax[0].set(xlabel="Age (months)")
km68.plot("male", color=colors["male"], ax=ax[1])
ax[1].set(title="Male Channing House residents (68 years or older)")
ax[1].set(xlabel="Age (months)")
plt.show()
plt.close()

In [10]:
km80 = KaplanMeier()
km80.fit(time="exit", entry="entry", status="status", group="sex",
data=channing, min_time=960)
Out[10]:
KaplanMeier(conf_level=0.95, conf_type='log-log', n_boot=500,
random_state=None, tie_break='discrete', var_type='greenwood')
In [11]:
_, ax = plt.subplots(nrows=2, ncols=1, figsize=(10, 12))
km80.plot("female", color=colors["female"], ax=ax[0])
ax[0].set(title="Female Channing House residents (80 years or older)")
ax[0].set(xlabel="Age (months)")
km80.plot("male", color=colors["male"], ax=ax[1])
ax[1].set(title="Male Channing House residents (80 years or older)")
ax[1].set(xlabel="Age (months)")
plt.show()
plt.close()

Estimating the cumulative hazard¶
In [12]:
na = NelsonAalen(var_type="greenwood")
na.fit(surv)
Out[12]:
NelsonAalen(conf_level=0.95, conf_type='log', tie_break='discrete',
var_type='greenwood')
In [13]:
plt.figure(figsize=(10, 6))
na.plot(colors=colors)
plt.show()
plt.close()

This is a small dataset (42 total observations) separated into two groups with heavy right-censoring in one and none in the other.
This is a slightly larger dataset (462 total observations) separated into two groups with right-censoring and left truncation in both.
Background and Theory¶
This page aims to quickly cover some basic survival analysis.
Note
This page is under construction.
Survival Analysis Setup¶
Suppose we are interested in a particular time-to-event distribution (e.g., time until death, time until disease occurrence or recovery, or time until failure in a mechanical system). The phrase “time-to-event” can mean nearly any positive quantity being measured, not necessarily time.
The time-to-event distribution is completely determined by its survival function \(S(t) = \Pr(X > t)\), where \(X\) is the time-to-event (a positive random variable), and \(t\) is a positive time. The survival function might be of interest itself since it answers questions like, “what is the probability that a cancer patient will survive at least five more years?” or “what is the probability that this machine part won’t break in the next six months?”
Suppose we have \(n\) individuals with independent and identically distributed event \(X_1, \ldots, X_n\) with survival function \(S\). If we could observe \(X_1, \ldots, X_n\), then one obvious candidate for estimating \(S(t)\) is the empirical survival function:
This is just the number of individuals observed to survive past time \(t\) divided by the sample size \(n\). (Here \(\mathbf{1}_A\) denotes the indicator of an event \(A\).) This unbiased, consistent, and asymptotically normal estimator is nevertheless not suitable for many situations of interest in survival analysis. The main problem is that it relies on all the event times \(X_1,\ldots,X_n\) being observable.
Censoring and Truncation¶
It is common in practice that not all the times \(X_1, \ldots, X_n\) are observed. We list some possible examples.
- A clinical trial investigating a disease treatment might end before a patient recovers from the disease, in which case the patient’s true recovery time is not known.
- A patient whose time until death from cancer is being monitored might die from a different disease, so the patient’s death-from-cancer time will not be known.
- In engineering, a reliability experiment might be stopped after a predetermined number of parts fail. A part that is still operational after this time will not have its failure time observed.
In these cases, the true time-to-event is not known, but a lower bound for it is available. This situation is called right-censoring.
Another source of incomplete information is left-truncation (also known as delayed entry), in which an individual’s time-to-event may only be observed if it exceeds a certain time. Some examples:
- Patients for a certain disease might only be observed after the disease has been diagnosed. A patient who died from the disease before diagnosis is unknown to investigators.
- If we are observing the age at death of residents in a retirement home, then we do not observe the ages at death of individuals who died before becoming residents.
- If we are measuring the diameters of particles with a microscope, then only particles large enough to be detected by the microscope will be observed.
There are other types of censoring and truncation, but for now we will focus on right-censoring and left-truncation, the most common variants.
When there is right-censoring or left-truncation, the empirical survival function is not a suitable estimator of the survival function \(S\) because the event times \(X_1, \ldots, X_n\) are not known. A popular alternative is the Kaplan-Meier estimator, which we will define below.
Let us first formalize the sampling situation with right-censoring and left-truncation. Suppose as before that we have \(n\) individuals with with independent and identically distributed times-to-event \(X_1, \ldots, X_n\) with survival function \(S\). Moreover, suppose we observe a sample \((L_1, X_1^\prime, \delta_1), \ldots, (L_n, X_n^\prime, \delta_n)\), where \(L_i\) is the entry time of the \(i\)-th individual, \(X_i^\prime\) is the observed final time for the \(i\)-th individual (so \(X_i^\prime > L_i\)), and \(\delta_i\) is zero or one according to whether \(X_i^\prime < X_i\) (the \(i\)-th individual is censored) or \(X_i^\prime = X_i\) (the \(i\)-th individual’s time-to-event is observed).
The Counting Process Formulation¶
We define two stochastic processes associated with this sample: the event counting process
and the at-risk process
The event counting process counts the number of events that have been observed to occur up to a certain time, and the at-risk process counts the number of individuals who are “at risk” (those who are already being observed but have not yet been censored or experienced the event of interest) at a certain time.
The trajectories (i.e., sample paths) of the event counting process are right-continuous with left limits (commonly called càdlàg). In fact, the event counting process is non-decreasing and piecewise constant with at most \(n\) jumps (the jumps happen at observed event times). Let \(0 < T_1 < T_2 < \cdots\) denote the ordered jump times. The size of the jump at time \(T_j\) is
where for \(t > 0\),
is a limit from the left. In other words, \(\Delta N(T_j)\) is the number of observed events at time \(T_j\). For convenience, also define \(T_0 = 0\).
Stochastic integrals with respect to \(N\) can be easily computed as sums:
The Kaplan-Meier Estimator¶
We will now give a derivation of the Kaplan-Meier survival function estimator. This will be an estimator of the survival function \(S(t) = \Pr(X > t)\), where \(X\) is the time-to-event, based on the right-censored and left-truncated sample \((L_1, X_1^\prime, \delta_1), \ldots, (L_n, X_n^\prime, \delta_n)\) described above.
First, observe that if we have times \(s < t\), then
Thus, it suffices to estimate the conditional probability \(\lambda(s, t) = \Pr(X > t \mid X > s)\) of surviving the time interval \((s, t]\) given survival up to time \(s\).
Todo
Finish this section.
API Reference¶
This page lists all the public classes and functions of Survive.
survive
: Top-Level Module¶
Survival analysis in Python.
Classes¶
Breslow (*[, conf_type, conf_level, …]) |
Breslow nonparametric survival function estimator. |
KaplanMeier (*[, conf_type, conf_level, …]) |
Kaplan-Meier nonparametric survival function estimator. |
NelsonAalen (*[, conf_type, conf_level, …]) |
Nelson-Aalen nonparametric cumulative hazard estimator. |
SurvivalData (time, *[, status, entry, …]) |
Class representing right-censored and left-truncated survival data. |
survive.base
: Base Classes¶
Base classes for survival analysis objects.
Classes¶
base.Fittable |
Abstract mixin class for models implementing a fit() method. |
base.Model |
Abstract base class for survival models. |
base.Predictor |
Abstract mixin class for models implementing a predict() method. |
base.Summary (model) |
Base class for summaries of survival models (intended for subclassing). |
survive.datasets
: Survival Datasets¶
Functions for loading some survival datasets.
Functions¶
datasets.channing () |
Load the Channing House dataset. |
datasets.leukemia () |
Load the leukemia dataset. |
survive.nonparametric
: Nonparametric Estimation¶
Nonparametric estimators.
Classes¶
nonparametric.NonparametricEstimator |
Abstract base class for nonparametric estimators. |
nonparametric.NonparametricSurvival |
Abstract base class for nonparametric survival function estimators. |
nonparametric.NonparametricEstimatorSummary (model) |
Summaries for nonparametric estimators. |
survive.utils
: Utility Functions¶
Miscellaneous utility functions.
Functions¶
utils.add_legend (ax[, legend_kwargs]) |
Add a legend to a plot. |
utils.check_bool (tf, *[, allow_none]) |
Validate boolean function arguments. |
utils.check_colors (colors, *[, n_colors, …]) |
Validate colors for plotting. |
utils.check_data_1d (data, *[, numeric, …]) |
Preprocess and validate a one-dimensional array. |
utils.check_data_2d (data, *[, numeric, …]) |
Preprocess and validate a two-dimensional array (i.e., a matrix). |
utils.check_float (num, *[, positive, …]) |
Validate floating-point number function arguments. |
utils.check_int (num, *[, minimum, maximum, …]) |
Validate integer function arguments. |
utils.check_random_state (random_state) |
Validate random number generators and seeds. |