Welcome to DL Data Validation Toolset’s documentation!¶
The DL Data Validation Toolset is a framework built around a series of tools meant to help establish if a generated HDF5 file for the DL in HEP effort contains valid data. Each test is meant to be run as a unit test on a generated file and establish what the file contains along with problems that might arise due to using this file for deep learning.
Goal¶
The goal of this project is to create comprehensive reports on data produced external to the deep learning framework. Reports created should contain information on the validity of using these files in Deep Learning Frameworks as well as detailed information on the contents of the files and diagnostically relevant calculations.
Part of the diagnostic tools should be some visualization of the contents of the files.
Installation¶
Prereqs¶
This framework depends heavily on HDF5. Users should have the HDF5 libraries with compression back installed.
For instance, on Ubuntu, this can be accomplished with
sudo apt-get install libhdf5-serial-dev
Similarly, on RHEL, this is done with dev
being replaced by devel
.
Installation with Pip¶
For non-development usage, one may install with:
pip install <options> git+https://github.com/HEP-DL/dl_data_validation_toolset
Where <options> typically contain --user
, --upgrade
or both.
Installation with dist-utils¶
For development usage, one may install with:
git clone https://github.com/HEP-DL/dl_data_validation_toolset
cd dl_data_validation_toolset
make install
Usage¶
Command Line Utility¶
The primary interface is the command line interface.
By default, the command line utility can be called with
generate_report
This will look for data in the data
subdirectory of the current working directory and place output in the results
subdirectory of the current working directory.
The library can be configured with YAML
by specifying the --config
flag.
An example configuration YAML looks like:
results_path: /home/my_user/results
scan_paths: [/home/my_user/dl_data/, /data/]
tar: False
Output¶
The command link util creates a tarball of the HTML output of the report generator. This is placed in the results path and is named by current time stamp.
If one untars the ball and opens the index.html
file, the summary page contains links to file reports, etc...
Note that in order for the page to appear correctly, internet connectivity is required.
Python API¶
Refer to the dl_data_validation_toolset page.
Development¶
This package is very much a work in progress.
Test Driven Development¶
Unit tests are kept in the tests directory. These are unit tests for developing code. This is not to be confused with data tests, though the code pattern for both resemble each other.
These can be run with the command:
make test
Bear in mind that flake8 is used to lint the code.
Documentation¶
Documentation can be built with:
make docs
Inline documentation follows Sphinx guidelines. An example can be found here.
Data Tests¶
The data validation tests can be found in dl_data_validation_toolset/data_tests.
To create a new data validation test or series of tests, simply copy one of the examples (such as consistency). Once your test is ready, please modify __test_names__ to match the filename of your new test, minus the file extension.