activegit documentation¶
Contents:
Overview¶
activegit uses git to create shareable, distributable repositories of data and classifiers for active learning. activegit defines a Python class that casts some standard git features as methods that simplify the active learning proces.
Active learning is a machine learning technique to iteratively train a classifier. The training of a classifier (“supervised learning”) can be labor intensive, as it requires serving data to experts (people) for tagging. This work is often shared by several members of a group. The goal of activegit is to wrap git functionality (e.g., tags, push/pull) to simplify sharing and distribution of a repository that keeps track of data used to classify and the classifier itself.
An activegit repo has three basic files:
- training.pkl – A pickle file with a dictionary containing data to train the classifier. The dictionary length equal to the number of feature sets that have been labeled. Each feature set contains one or more features, but all feature sets must be the same size and match the number expected by the classifier.
- testing.pkl – Similar to training.pkl, but this file contains a dictionary used to test the classifier.
- classifier.pkl – A pickle file containing the classifier.
After initializing an activegit repo, the training, testing, and classifier pickle files are filled and committed with a version name. This version name is tied to a git tag, so it can be recalled at any time.
Examples¶
activegit can be initialized by instantiating the base class with a path to a repo. If it does not exist, an empty repo will be filled:
> ag = activegit.ActiveGit('agdir')
ActiveGit initializing from repo at agdir
Available versions: initial
After it has been initialized, features can be served to an expert for labeling (e.g., “real/bogus” or strings as tags). The features and corresponding target labels can then be saved:
> ag.write_training_data(features, targets)
> ag.write_classifier(clf)
> ag.commit_version('newversion')
> ag.versions
['initial', 'newversion']
Methods¶
activegit defines a single class with methods and properties that wrap git features, such as tags and push/pull. Wrapping these features allows them to be cast to an active learning context.
-
class
activegit.
ActiveGit
(repopath, bare=False, shared='group')¶ Uses a git repo to keep track of active learning data and classifier.
The standard set of files is: ‘training.pkl’, ‘testing.pkl’, and ‘classifier.pkl’. First two each contain a dictionary with features as keys and target labels (e.g., 0/1) as values. The third file contains the classifier (e.g., from sklearn).
Tags are central to tracking classifier and data. A new repo starts with empty files and a tag “initial”. Branch ‘master’ keeps latest and branch ‘working’ is used for active session. After committing a new version, the working is merged to master, deleted, and a new working branch checked out.
Setting bare=True creates a bare git repo that can be shared (cloned) by a group locally or via git daemon sharing.
-
classifier
¶ Returns classifier from classifier.pkl
-
commit_version
(version, msg=None)¶ Add tag, commit, and push changes
-
initializerepo
()¶ Fill empty directory with products and make first commit
-
isvalid
¶ Checks whether contents of repo are consistent with standard set.
-
set_version
(version, force=True)¶ Sets the version name for the current state of repo
-
show_version_info
(version)¶ Summarizes info of a particular version (a la “git show version”)
-
testing_data
¶ Returns data dictionary from testing.pkl
-
training_data
¶ Returns data dictionary from training.pkl
-
update
()¶ Pull latest versions/tags, if linked to a remote (e.g., github).
-
version
¶ Current version checked out.
-
versions
¶ Sorted list of versions committed thus far.
-
write_classifier
(clf)¶ Writes classifier object to pickle file
-
write_testing_data
(features, targets)¶ Writes data dictionary to filename
-
write_training_data
(features, targets)¶ Writes data dictionary to filename
-