activegit documentation

Contents:

Overview

activegit uses git to create shareable, distributable repositories of data and classifiers for active learning. activegit defines a Python class that casts some standard git features as methods that simplify the active learning proces.

Active learning is a machine learning technique to iteratively train a classifier. The training of a classifier (“supervised learning”) can be labor intensive, as it requires serving data to experts (people) for tagging. This work is often shared by several members of a group. The goal of activegit is to wrap git functionality (e.g., tags, push/pull) to simplify sharing and distribution of a repository that keeps track of data used to classify and the classifier itself.

An activegit repo has three basic files:

  1. training.pkl – A pickle file with a dictionary containing data to train the classifier. The dictionary length equal to the number of feature sets that have been labeled. Each feature set contains one or more features, but all feature sets must be the same size and match the number expected by the classifier.
  2. testing.pkl – Similar to training.pkl, but this file contains a dictionary used to test the classifier.
  3. classifier.pkl – A pickle file containing the classifier.

After initializing an activegit repo, the training, testing, and classifier pickle files are filled and committed with a version name. This version name is tied to a git tag, so it can be recalled at any time.

Installation

activegit is available via pypi, so simply:

pip install activegit

Examples

activegit can be initialized by instantiating the base class with a path to a repo. If it does not exist, an empty repo will be filled:

> ag = activegit.ActiveGit('agdir')
ActiveGit initializing from repo at agdir
Available versions: initial

After it has been initialized, features can be served to an expert for labeling (e.g., “real/bogus” or strings as tags). The features and corresponding target labels can then be saved:

> ag.write_training_data(features, targets)
> ag.write_classifier(clf)
> ag.commit_version('newversion')
> ag.versions
['initial', 'newversion']

Methods

activegit defines a single class with methods and properties that wrap git features, such as tags and push/pull. Wrapping these features allows them to be cast to an active learning context.

class activegit.ActiveGit(repopath, bare=False, shared='group')

Uses a git repo to keep track of active learning data and classifier.

The standard set of files is: ‘training.pkl’, ‘testing.pkl’, and ‘classifier.pkl’. First two each contain a dictionary with features as keys and target labels (e.g., 0/1) as values. The third file contains the classifier (e.g., from sklearn).

Tags are central to tracking classifier and data. A new repo starts with empty files and a tag “initial”. Branch ‘master’ keeps latest and branch ‘working’ is used for active session. After committing a new version, the working is merged to master, deleted, and a new working branch checked out.

Setting bare=True creates a bare git repo that can be shared (cloned) by a group locally or via git daemon sharing.

classifier

Returns classifier from classifier.pkl

commit_version(version, msg=None)

Add tag, commit, and push changes

initializerepo()

Fill empty directory with products and make first commit

isvalid

Checks whether contents of repo are consistent with standard set.

set_version(version, force=True)

Sets the version name for the current state of repo

show_version_info(version)

Summarizes info of a particular version (a la “git show version”)

testing_data

Returns data dictionary from testing.pkl

training_data

Returns data dictionary from training.pkl

update()

Pull latest versions/tags, if linked to a remote (e.g., github).

version

Current version checked out.

versions

Sorted list of versions committed thus far.

write_classifier(clf)

Writes classifier object to pickle file

write_testing_data(features, targets)

Writes data dictionary to filename

write_training_data(features, targets)

Writes data dictionary to filename

Indices and tables