Welcome to ML-generalization’s documentation!

This framework has been developed in the course of the Machine Learning lecture 184.702 at TU Wien.

Installation

The software has been developed using Ubuntu 16.04.

Requirements

Make sure to install these requirements first:

sudo apt install python-pip python-dev build-essential
sudo pip install --upgrade pip
sudo pip install --upgrade virtualenv
pip install pipenv

Download the software

Clone this repository by executing:

git clone https://github.com/tempse/ML-generalization.git

Install dependencies

Change into the downloaded repository folder (probably located in ~/ML-generalization/) and install all dependencies via pipenv:

cd ~/ML-generalization
pipenv install

This automatically creates a virtual environment and installs all dependencies into it.

Run commands in the created virtual environment

There are two ways to execute commands from within the newly created virtual environment:

  1. Activate the environment by:

    pipenv shell
    
(For this, you have to be in the same folder as the project’s Pipfile.)
  1. Invoke shell commands without explicitely activating the environment:

    pipenv run <command>
    

    (Example: pipenv run python generalization.py or pipenv run pytest -v)

Running the tests

To run the automated tests, execute:

pytest -v --cov=generalization

(If you did not activate the virtual environment with pipenv shell, you have to preprend pipenv run to the above command!)

Usage

In order to execute the program, run:

python generalization.py

(If you did not activate the virtual environment with pipenv shell, prepend ``pipenv run `` to the above command!)

For the most part, the program is controlled by the three configuration files located in config/.

Configure run parameters in run_params.conf

The default section is called [run_params]. If not otherwise specified, the parameters contained in it are loaded upon program execution.

If a different set of run parameters should be used, just define a custom section and pass its name to the program as the terminal argument -run_mode <my_parameter_set> to invoke loading of these parameters instead of the default ones.

Required arguments for run_params.conf:

  • frac_train_sample: fractional size of training sample (float)
  • num_test_samples: number of test samples (integer)
  • num_CV_folds: number of cross-validation folds (integer)
  • do_optimize_params: specify whether automatic parameter optimization should be performed instead of using the model parameters defined in model_params.conf
  • n_iter: number of parameter settings that are samples during hyperparameter optimization via Bayesian Optimization

Configure data parameters in data_params.conf

In this configuration file, a virtually arbitrary number of parameter sections can be defined. All datasets that are defined here are processed by the program.

Required arguments for data_params.conf

  • data_identifier: a string that unambiguously identifies the dataset (string)
  • data_path: relative or absolute path to the data file (string)
  • data_read_func: name of a Pandas method to read the file (string)
  • data_target_col: name of the column holding the true class label (string)
  • data_target_positive_label: value of the positive class label
  • data_target_negative_label: value of the negative class label
  • data_preprocessing: list of preprocessing methods that are applied to the data in the given order
    (currently implemented functions: standard_scale, parse_object_columns, fill_numerical, fill_categorical, rm_correlated, rm_low_variance)

Optional arguments for data_params.conf

All arguments that are known ones for the function given in data_read_func can be stated here.

(Example: For the Pandas method read_csv, valid options are “sep”, “header”, etc…)

Configure model parameters in model_params.conf

In this configuration file, a virtually arbitrary number of model parameter sections can be defined. All models that are defined here are processed by the program. Parameters that are not explicitly specified here take on their scikit-learn default values

The section header (the string between the square brackets) must be identical to the scikit-learn model name of the algorithm (e.g., “KNeighborsClassifier”, “RandomForestClassifier”,…)!

Results output structure

Each program run, a unique session folder is created in the subfolder output/ that contains date and time of when the script was started. All output files of the particular run are stored in this folder. Per default, only the 20 latest sessions folders are kept and older ones are removed when the program is started again. In this way, the results are not only clearly separated for each program run, but accidental overwriting of old results can be largely avoided.

The automatic cleanup of old session folders can be deactivated altogether by setting keep_sessions=None at initialization of the OutputManager object in the source code.

Uninstallation

To uninstall all installed dependencies, simply run:

pipenv uninstall --all

In order to also remove the virtual environment that has been created by pipenv, remove the corresponding folder in /home/<user>/.local/share/virtualenvs/.

Version history

0.1.4 (2018-05-04)

  • switch from unittest to pytest
  • improve test coverage
  • project restructuring

0.1.3 (2018-05-02)

  • switch installation method to pipenv
  • update documentation for the use of pipenv
  • remove landscape integration
  • integrate Better Code Hub

0.1.2 (2018-04-01)

  • add documentation, automation via readthedocs
  • code quality improvements
  • switch to pytest for travis-CI
  • minor bugfixes

0.1.1 (2018-03-31)

  • automated tests
  • automated code coverage reports
  • automatic code quality reports
  • new project structure
  • code quality improvements
  • minor bugfixes
  • update requirements

0.1.0 (2018-01-31)

  • submitted version

Indices and tables