Welcome to the healthcare.ai documentation!¶
The purpose of this python package is to streamline healthcare machine learning. It does this by including functionality specific to healthcare, as well as simplifying the workflow of creating and deploying models.
Do I want this package over the R package?¶
Choose this Python package if most of the following apply:
- You’re familiar with python
- You’re familiar with machine learning
- You’re working with 5M+ rows (which is rare)
Otherwise, the R package is recommended, as it currently has more features and R is more newbie-friendly
If you’re excited about Python, see these links to start:
Getting started with healthcare.ai¶
What can you do with this package?¶
- Fill in missing data via imputation
- Create and compare models based on your data
- Save a model to produce daily predictions
- Write predictions back to a database
- Learn what factor drives each prediction
How to install¶
Windows
- If you haven’t, install 64-bit Python 3.5 via the Anaconda distribution
- Open Spyder (which installed with Anaconda)
- Run
conda install pyodbc
- To install the latest release, run
pip install https://github.com/HealthCatalystSLC/healthcareai-py/zipball/v0.1.7-beta
- If you know what you’re doing, and want the bleeding-edge version, run
pip install https://github.com/HealthCatalystSLC/healthcareai-py/zipball/master
Non-Windows
Frequently asked questions¶
Who is this project for?¶
While data scientists in healthcare will likely find this project valuable, the target audience for HCTools are those BI developers, data architects, and SQL developers that would love to create appropriate and accurate models with healthcare data. While existing machine learning packages are certainly irreplaceable, we think that there is a set of data problems specific to healthcare that warrant new tools.
How does HCRTools focus on healthcare?¶
HCTools differs from other machine learning packages in that it focuses on data issues specific to healthcare. This means that we pay attention to longitudinal questions, offer an easy way to do risk-adjusted comparisons, and provide easy connections and deployment to databases.
Who started this project?¶
This project began in the data science group at Health Catalyst, a Salt Lake City-based company based focused on improving healthcare outcomes.
Why was it open-sourced?¶
We believe that everyone benefits when healthcare is made more efficient and outcomes are improved. Machine learning is surprisingly still fairly new to healthcare and we want to quickly take healthcare down the machine learning adoption path. We believe that making helpful, simple tools widely available is one small way to help healthcare organizations transform their data into actionable insight that can be used to improve outcomes.
How can I contact the authors¶
We’d love to hear from you! We welcome complaints, suggestions, and contributions.
Twitter: @levithatcher Email: levi.thatcher@healthcatalyst.com
Hints and tips¶
Gathering the data¶
If you have interesting data in a CSV file or even a cross serveral databases on a single server, you are in good shape. While it’s easiest to pull data into the package via a single table, one can also use joins to gather data from separate tables or databases. What’s most important is the following:
- You have a column you’re excited about predicting and some data that might be relevant
- If you’re predicting a binary outcome (ie, 0 or 1), you have to convert the column to be Y or N.
Pre-processing¶
It’s almost always helpful to do some feature engineering before creating a model. Here are some practical examples of that:
- If you think the thing your predicting might have a seasonal pattern, you could convert a date-time column into columns representing DayOfWeek, DayOfMonth, WeekOfYear, etc.
- If you have rows with both a latitude and longitude, it may be beneficial to add a zip code column (for example)
Model building tips¶
- Start small. You can often get a good idea of model performance by starting with 10k rows instead of 1M.
- Don’t throw out rows with missing values. We’ll help you experiment with imputation, which may improve the model’s performance.
- Focus on new features. Rather than finding more rows of the same columns, finding better columns (ie, features) will give better results.
Developing and comparing models¶
What is DevelopSupervisedModel
?¶
- This class let’s one create and compare custom models on diverse datasets.
- One can do both classification (ie, predict Y/N) as well as regression (ie, predict a numeric field).
- To jump straight to an example notebook, see here
Am I ready for model creation?¶
Maybe. It’ll help if you follow these guidelines:
- Don’t use 0 or 1 for the independent variable when doing classification. Use Y/N instead. The IIF function in T-SQL may help here.
- Don’t pull in test data in this step. In other words, we just pull in those rows where the target (ie, predicted column has a value already).
Of course, feature engineering is always a good idea.
Step 1: Pull in the data¶
For SQL:
import pyodbc
cnxn = pyodbc.connect("""SERVER=localhost;
DRIVER={SQL Server Native Client 11.0};
Trusted_Connection=yes;
autocommit=True""")
df = pd.read_sql(
sql="""SELECT *
FROM [SAM].[dbo].[HCPyDiabetesClinical]""",
con=cnxn)
# Handle missing data (if needed)
df.replace(['None'],[None],inplace=True)
For CSV:
df = pd.read_csv('healthcareai/tests/fixtures/HCPyDiabetesClinical.csv',
na_values=['None'])
Step 2: Set your data-prep parameters¶
The DevelopSupervisedModel
class cleans and prepares the data before model creation
Return: an object.
- Arguments:
- modeltype: a string. This will either be ‘classification’ or ‘regression’.
- df: a data frame. The data your model will be based on.
- predictedcol: a string. Name of variable (or column) that you want to predict.
- graincol: a string, defaults to None. Name of possible GrainID column in your dataset. If specified, this column will be removed, as it won’t help the algorithm.
- impute: a boolean. Whether to impute by replacing NULLs with column mean (for numeric columns) or column mode (for categorical columns).
- debug: a boolean, defaults to False. If TRUE, console output when comparing models is verbose for easier debugging.
Example code:
o = DevelopSupervisedModel(modeltype='classification',
df=df,
predictedcol='ThirtyDayReadmitFLG',
graincol='PatientEncounterID', #OPTIONAL
impute=True,
debug=False)
Step 3: Create and compare models¶
Example code:
# Run the linear model
o.linear(cores=1)
# Run the random forest model
o.random_forest(cores=1)
Go further using utility methods¶
The plot_rffeature_importance
method plots the input columns in order of importance to the model.
Return: a plot.
- Arguments:
- save: a boolean, defaults to False. If True, the plot is saved to the location displayed in the console.
Example code:
# Look at the feature importance rankings
o.plot_rffeature_importance(save=False)
The plot_roc
method plots the AU_ROC chart, for easier model comparison.
Return: a plot.
- Arguments:
- save: a boolean, defaults to False. If True, the plot is saved to the location displayed in the console.
- debug: a boolean. If True, console output is verbose for easier debugging.
Example code:
# Create ROC plot to compare the two models
o.plot_roc(debug=False,
save=False)
Full example code¶
Note: you can run (out-of-the-box) from the healthcareai-py folder:
from healthcareai import DevelopSupervisedModel
import pandas as pd
import time
def main():
t0 = time.time()
# CSV snippet for reading data into dataframe
df = pd.read_csv('healthcareai/tests/fixtures/HCPyDiabetesClinical.csv',
na_values=['None'])
# SQL snippet for reading data into dataframe
import pyodbc
cnxn = pyodbc.connect("""SERVER=localhost;
DRIVER={SQL Server Native Client 11.0};
Trusted_Connection=yes;
autocommit=True""")
df = pd.read_sql(
sql="""SELECT *
FROM [SAM].[dbo].[HCPyDiabetesClinical]
-- In this step, just grab rows that have a target
WHERE ThirtyDayReadmitFLG is not null""",
con=cnxn)
# Set None string to be None type
df.replace(['None'],[None],inplace=True)
# Look at data that's been pulled in
print(df.head())
print(df.dtypes)
# Drop columns that won't help machine learning
df.drop(['PatientID','InTestWindowFLG'],axis=1,inplace=True)
# Step 1: compare two models
o = DevelopSupervisedModel(modeltype='classification',
df=df,
predictedcol='ThirtyDayReadmitFLG',
graincol='PatientEncounterID', #OPTIONAL
impute=True,
debug=False)
# Run the linear model
o.linear(cores=1)
# Run the random forest model
o.random_forest(cores=1,
tune=True)
# Look at the RF feature importance rankings
o.plot_rffeature_importance(save=False)
# Create ROC plot to compare the two models
o.plot_roc(debug=False,
save=False)
print('\nTime:\n', time.time() - t0)
if __name__ == "__main__":
main()
Deploying and saving a model¶
What is DeploySupervisedModel
?¶
- This class lets one save a model (for recurrent use) and push predictions to a database
- One can do both classification (ie, predict Y/N) as well as regression (ie, predict a numeric field).
Am I ready for model deployment?¶
Only if you’ve already completed these steps:
- You’ve found a model work that works well on your data
- You’ve created a column called InTestWindowFLG (or something similar), where ‘Y’ denotes rows that need a prediction and ‘N’ for rows that train the model.
- You’ve created the SQL table structure to receive predictions
For classification predictions:
CREATE TABLE [SAM].[dbo].[HCPyDeployClassificationBASE] (
[BindingID] [int] ,
[BindingNM] [varchar] (255),
[LastLoadDTS] [datetime2] (7),
[PatientEncounterID] [decimal] (38, 0), --< change to your grain col
[PredictedProbNBR] [decimal] (38, 2),
[Factor1TXT] [varchar] (255),
[Factor2TXT] [varchar] (255),
[Factor3TXT] [varchar] (255))
For regression predictions:
CREATE TABLE [SAM].[dbo].[HCPyDeployRegressionBASE] (
[BindingID] [int],
[BindingNM] [varchar] (255),
[LastLoadDTS] [datetime2] (7),
[PatientEncounterID] [decimal] (38, 0), --< change to your grain col
[PredictedValueNBR] [decimal] (38, 2),
[Factor1TXT] [varchar] (255),
[Factor2TXT] [varchar] (255),
[Factor3TXT] [varchar] (255))
Step 1: Pull in the data¶
For SQL:
import pyodbc
cnxn = pyodbc.connect("""SERVER=localhost;
DRIVER={SQL Server Native Client 11.0};
Trusted_Connection=yes;
autocommit=True""")
df = pd.read_sql(
sql="""SELECT
*
FROM [SAM].[dbo].[HCPyDiabetesClinical]""",
con=cnxn)
# Handle missing data (if needed)
df.replace(['None'],[None],inplace=True)
For CSV:
df = pd.read_csv('healthcareai/tests/fixtures/HCPyDiabetesClinical.csv',
na_values=['None'])
Step 2: Set your data-prep parameters¶
The DeploySupervisedModel
cleans and prepares the data prior to model creation.
Return: an object.
- Arguments:
- modeltype: a string. This will either be ‘classification’ or ‘regression’.
- df: a data frame. The data your model will be based on.
- predictedcol: a string. Name of variable (or column) that you want to predict.
- graincol: a string, defaults to None. Name of possible GrainID column in your dataset. If specified, this column will be removed, as it won’t help the algorithm.
- impute: a boolean. Whether to impute by replacing NULLs with column mean (for numeric columns) or column mode (for categorical columns).
- debug: a boolean, defaults to False. If TRUE, console output when comparing models is verbose for easier debugging.
- windowcol: a string. Which column in the dataset denotes which rows are test (‘Y’) or training (‘N’).
Example code:
p = DeploySupervisedModel(modeltype='regression',
df=df,
graincol='PatientEncounterID',
windowcol='InTestWindowFLG',
predictedcol='LDLNBR',
impute=True,
debug=False)
Step 3: Create and save the model¶
The deploy
creates the model and method makes predictions that are pushed to a database.
Return: an object.
- Arguments:
- method: a string. If you choose random forest, use ‘rf’. If you choose to deploy the linear model, use ‘linear’.
- cores: an integer. Denotes how many of your processors to use.
- server: a string. Which server are you pushing predictions to?
- dest_db_schema_table: a string. Which database.schema.table are you pushing predictions to?
- trees: an integer, defaults to 200. Use only if working with random forest. This denotes number of trees in the forest.
- debug: a boolean, defaults to False. If TRUE, console output when comparing models is verbose for easier debugging.
Example code:
p.deploy(method='rf',
cores=2,
server='localhost',
dest_db_schema_table='[SAM].[dbo].[HCPyDeployRegressionBASE]',
use_saved_model=False,
trees=200,
debug=False)
Full example code¶
from healthcareai import DeploySupervisedModel
import pandas as pd
import time
def main():
t0 = time.time()
# Load in data
# CSV snippet for reading data into dataframe
df = pd.read_csv('healthcareai/tests/fixtures/HCPyDiabetesClinical.csv',
na_values=['None'])
# SQL snippet for reading data into dataframe
# import pyodbc
# cnxn = pyodbc.connect("""SERVER=localhost;
# DRIVER={SQL Server Native Client 11.0};
# Trusted_Connection=yes;
# autocommit=True""")
#
# df = pd.read_sql(
# sql="""SELECT *
# FROM [SAM].[dbo].[HCPyDiabetesClinical]""",
# con=cnxn)
#
# # Set None string to be None type
# df.replace(['None'],[None],inplace=True)
# Look at data that's been pulled in
print(df.head())
print(df.dtypes)
# Drop columns that won't help machine learning
df.drop('PatientID', axis=1, inplace=True)
p = DeploySupervisedModel(modeltype='regression',
df=df,
graincol='PatientEncounterID',
windowcol='InTestWindowFLG',
predictedcol='LDLNBR',
impute=True,
debug=False)
p.deploy(method='rf',
cores=2,
server='localhost',
dest_db_schema_table='[SAM].[dbo].[HCPyDeployRegressionBASE]',
use_saved_model=False,
trees=200,
debug=False)
print('\nTime:\n', time.time() - t0)
if __name__ == "__main__":
main()