Welcome to ML-generalization’s documentation!¶
This framework has been developed in the course of the Machine Learning lecture 184.702 at TU Wien.
Installation¶
The software has been developed using Ubuntu 16.04.
Requirements¶
Make sure to install these requirements first:
sudo apt install python-pip python-dev build-essential
sudo pip install --upgrade pip
sudo pip install --upgrade virtualenv
pip install pipenv
Download the software¶
Clone this repository by executing:
git clone https://github.com/tempse/ML-generalization.git
Install dependencies¶
Change into the downloaded repository folder (probably located in ~/ML-generalization/
) and install all dependencies via pipenv
:
cd ~/ML-generalization
pipenv install
This automatically creates a virtual environment and installs all dependencies into it.
Run commands in the created virtual environment¶
There are two ways to execute commands from within the newly created virtual environment:
Activate the environment by:
pipenv shell
(For this, you have to be in the same folder as the project’sPipfile
.)
Invoke shell commands without explicitely activating the environment:
pipenv run <command>
(Example:
pipenv run python generalization.py
orpipenv run pytest -v
)
Running the tests¶
To run the automated tests, execute:
pytest -v --cov=generalization
(If you did not activate the virtual environment with pipenv shell
, you have to preprend pipenv run
to the above command!)
Usage¶
In order to execute the program, run:
python generalization.py
(If you did not activate the virtual environment with pipenv shell
, prepend ``pipenv run `` to the above command!)
For the most part, the program is controlled by the three configuration files located in config/
.
Configure run parameters in run_params.conf
¶
The default section is called [run_params]
. If not otherwise specified, the parameters contained in it are loaded upon program execution.
If a different set of run parameters should be used, just define a custom section and pass its name to the program as the terminal argument-run_mode <my_parameter_set>
to invoke loading of these parameters instead of the default ones.
Required arguments for run_params.conf
:¶
frac_train_sample
: fractional size of training sample (float)num_test_samples
: number of test samples (integer)num_CV_folds
: number of cross-validation folds (integer)do_optimize_params
: specify whether automatic parameter optimization should be performed instead of using the model parameters defined inmodel_params.conf
n_iter
: number of parameter settings that are samples during hyperparameter optimization via Bayesian Optimization
Configure data parameters in data_params.conf
¶
In this configuration file, a virtually arbitrary number of parameter sections can be defined. All datasets that are defined here are processed by the program.
Required arguments for data_params.conf
¶
data_identifier
: a string that unambiguously identifies the dataset (string)data_path
: relative or absolute path to the data file (string)data_read_func
: name of a Pandas method to read the file (string)data_target_col
: name of the column holding the true class label (string)data_target_positive_label
: value of the positive class labeldata_target_negative_label
: value of the negative class labeldata_preprocessing
: list of preprocessing methods that are applied to the data in the given order- (currently implemented functions:
standard_scale
,parse_object_columns
,fill_numerical
,fill_categorical
,rm_correlated
,rm_low_variance
)
Optional arguments for data_params.conf
¶
All arguments that are known ones for the function given in data_read_func
can be stated here.
(Example: For the Pandas methodread_csv
, valid options are “sep”, “header”, etc…)
Configure model parameters in model_params.conf
¶
In this configuration file, a virtually arbitrary number of model parameter sections can be defined. All models that are defined here are processed by the program. Parameters that are not explicitly specified here take on their scikit-learn default values
The section header (the string between the square brackets) must be identical to the scikit-learn model name of the algorithm (e.g., “KNeighborsClassifier”, “RandomForestClassifier”,…)!
Results output structure¶
Each program run, a unique session folder is created in the subfolder output/
that contains date and time of when the script was started. All output files of the particular run are stored in this folder. Per default, only the 20 latest sessions folders are kept and older ones are removed when the program is started again. In this way, the results are not only clearly separated for each program run, but accidental overwriting of old results can be largely avoided.
The automatic cleanup of old session folders can be deactivated altogether by settingkeep_sessions=None
at initialization of theOutputManager
object in the source code.
Uninstallation¶
To uninstall all installed dependencies, simply run:
pipenv uninstall --all
In order to also remove the virtual environment that has been created by pipenv
, remove the corresponding folder in /home/<user>/.local/share/virtualenvs/
.
Version history¶
0.1.4 (2018-05-04)¶
- switch from unittest to pytest
- improve test coverage
- project restructuring
0.1.3 (2018-05-02)¶
- switch installation method to pipenv
- update documentation for the use of pipenv
- remove landscape integration
- integrate Better Code Hub
0.1.2 (2018-04-01)¶
- add documentation, automation via readthedocs
- code quality improvements
- switch to pytest for travis-CI
- minor bugfixes
0.1.1 (2018-03-31)¶
- automated tests
- automated code coverage reports
- automatic code quality reports
- new project structure
- code quality improvements
- minor bugfixes
- update requirements
0.1.0 (2018-01-31)¶
- submitted version