Welcome to Holoclean’s documentation!¶

Holoclean¶
Noisy and erroneous data is a major bottleneck in analytics. Data cleaning and repairing account for about 60% of the work of data scientists. To address this bottleneck, we recently introduced HoloClean, a semi-automated data repairing framework that relies on statistical learning and inference to repair errors in structured data. In HoloClean, we build upon the paradigm of weak supervision and demonstrate how to leverage diverse signals, including user-defined heuristic rules (such as generalized data integrity constraints) to repair erroneous data.
Holoclean object¶
1 2 3 | class HoloClean()
class Session("Session", holo_obj)
|
Ingesting Input file¶
1 | session.ingest_dataset(dataset)
|
Error detection¶
In the Holoclean pipeline, the user can choose the way he wants to seperate the clean from the don’t know cells.See tutorials for a more in-depth explanation.
1 | ErrorDetectors(session.Denial_constraints, holo_obj.dataengine,holo_obj.spark_session, session.dataset)
|
Featurization¶
In the Holoclean pipeline, the user can choose the signal that he wants to use in order to train the model.
Domain Pruning¶
Holoclean gives the option to the user to prune the active domain
1 | session.ds_domain_pruning(pruning_threshold)
|
Signals¶
1 2 3 4 5 | SignalInit(session.Denial_constraints, holo_obj.dataengine,session.dataset)
SignalCooccur(session.Denial_constraints, holo_obj.dataengine,session.dataset )
SignalDC(session.Denial_constraints, holo_obj.dataengine, session.dataset, holo_obj.spark_session)
|
Learning¶
Currently we provide one basic models in Pytorch: Logistic regression. See tutorials for a more in-depth explanation.
1 | SoftMax(holo_obj.dataengine, session.dataset, holo_obj.spark_session,session.X_training)
|