rfgb

Getting Started

Installation and a few motivating examples.

Development Environment Setup

The first step is to get Python running on your machine (skip to the next step if you’ve already done this).

Linux (yum/dnf)

$ sudo yum update
$ sudo yum install python

Linux (apt-get)

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install python

Windows

Download Python from python.org or anaconda.com.

A fairly in-depth guide is available as part of the Conda documentation.

Getting Started

Installation

rfgb can be installed via the following methods:

  1. Stable builds on PyPi

    pip install rfgb
    
  2. Development builds on GitHub

    pip install git+git://github.com/hayesall/rfgb.git
    
  3. Bleeding-edge development builds on the GitHub Development Branch

    pip install git+git://github.com/hayesall/rfgb.git@development
    

Background

The main function of rfgb is to learn relational dependency networks [1] via gradient tree boosting, based on Natarajan et al. “Boosting Relational Dependency Networks” [2].

_images/rfgb.svg

This algorithm is implemented as the __main__ method for the rfgb package.

# rfgb.__main__

from .boosting import updateGradients
from .tree import node
from .utils import Utils

# ... class Arguments:

parameters = Arguments().args

for target in parameters.target:

    # Read the training data
    trainData = Utils.readTrainingData(target,
                                       path=parameters.train,
                                       regression=parameters.reg,
                                       advice=parameters.expAdvice)

    # Initialize an empty list for the trees.
    trees = []

    # Learn each tree and update the gradients.
    for i in range(parameters.trees):

        node.setMaxDepth(2)
        node.learnTree(trainData)
        trees.append(node.learnedDecisionTree)
        updateGradients(trainData, trees)

File Structure

File structure follows the structure used by BoostSRL.

Training directories and testing directories are currently used and flat files are read from, converted to a relational internal representation, and then the relationships may be reasoned about.

References

Running rfgb

Reasoning about the World

Object and their relationships are a natural way to think about the world. In this example, we have some facts about the world which we want to learn from. More specifically, we have a table of people and their relationships.

Name Gender Child Sibling
James Male [Harry]
Lily Female [Harry] Petunia
Harry Male
Arthur Male [Ron, Fred]
Molly Female [Ron, Fred]
Ron Male
[Fred]
Fred Male
[Ron]

Assume that the goal is to learn father(Y,X). We want to learn logical rules representing that domain object X is the father of Y (both of which are people in this case), given that you know information about their gender, children, and siblings.

From Tables to First-order Predicate Logic

Once we have a high-level idea of what these relationships look like, the next step is to convert this into predicate logic format. This format is standard for most Prolog-based systems.

A few assumptions we will make about our data:

  1. ‘Name’ is an identifier.
  2. ‘Gender’ is male or female in this case, so we can make it a true/false value.
  3. ‘Child’ and ‘Sibling’ are binary relationships encoding a relationship between two people (e.g. childof(lily, harry) denotes that ‘harry’ is the childof ‘lily’).

The target we want to learn is father(x,y). To learn this rule, rfgb learns a decision tree that most effectively splits the positive and negative examples. This example is fairly small so a small number of trees should suffice, but for more complicated problem more may be needed to learn a robust model.

Positive Examples:

father(harrypotter,jamespotter).
father(ginnyweasley,arthurweasley).
father(ronweasley,arthurweasley).
father(fredweasley,arthurweasley).
...

Negative Examples:

father(harrypotter,mollyweasley).
father(georgeweasley,jamespotter).
father(harrypotter,arthurweasley).
father(harrypotter,lilypotter).
father(ginnyweasley,harrypotter).
father(mollyweasley,arthurweasley).
father(fredweasley,georgeweasley).
father(georgeweasley,fredweasley).
father(harrypotter,ronweasley).
father(georgeweasley,harrypotter).
father(mollyweasley,lilypotter).
...

Facts:

male(jamespotter).
male(harrypotter).
male(arthurweasley).
male(ronweasley).
male(fredweasley).
male(georgeweasley).
siblingof(ronweasley,fredweasley).
siblingof(ronweasley,georgeweasley).
siblingof(ronweasley,ginnyweasley).
siblingof(fredweasley,ronweasley).
siblingof(fredweasley,georgeweasley).
siblingof(fredweasley,ginnyweasley).
siblingof(georgeweasley,ronweasley).
siblingof(georgeweasley,fredweasley).
siblingof(georgeweasley,ginnyweasley).
siblingof(ginnyweasley,ronweasley).
siblingof(ginnyweasley,fredweasley).
siblingof(ginnyweasley,georgeweasley).
childof(jamespotter,harrypotter).
childof(lilypotter,harrypotter).
childof(arthurweasley,ronweasley).
childof(mollyweasley,ronweasley).
childof(arthurweasley,fredweasley).
childof(mollyweasley,fredweasley).
childof(arthurweasley,georgeweasley).
childof(mollyweasley,georgeweasley).
childof(arthurweasley,ginnyweasley).
childof(mollyweasley,ginnyweasley).
...

Training a Model

There is one more piece we still need: background knowledge about the world.

// Parameters
setParam: maxTreeDepth=3.
setParam: nodeSize=1.
setParam: numOfClauses=8.

// Modes
mode: male(+name).
mode: childof(+name,+name).
mode: siblingof(+name,-name).
mode: father(+name,+name).

Begin training:

python -m rfgb --help

Development

Comments on developing rfgb further.

Contributing

From the BoostSRL Contributing Guidelines:

“Our goal is to push the boundaries of machine learning and statistical relational learning through open development and explainable approaches to decision making in both learning and inference. We believe that these are some of the best ways to create trustworthy systems that people can learn from and interract with in their daily lives.”

The goal in this project is to match and eventually extend beyond BoostSRL (the Java version of the codebase), contributions which further this are welcome.

Code of Conduct

We adopt the Contributor Covenant Code of Conduct

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at alexander.hayes@utdallas.edu. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate for the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of the incident.

Development Cheat-Sheet

  1. Fork and clone the source from GitHub

    git clone https://github.com/hayesall/rfgb.git
    
  2. Building local copy of documentation

    We use Sphinx autodoc with a combination of inline docstrings and reStructuredText for documenting this project. Pull requests and further updates should include appropriate documentation.

    A local copy of the documentation may be built from the Makefile:

    cd docs
    make html
    xdg-open build/html/index.html
    
  3. Running the unit tests

    rfgb/tests/ contains a suite of unit tests, these can be ran via the following:

    python rfgb/tests/tests.py
    

    Note

    As of 0.2.0, these should be ran from the base of the repository due to their import structure.

Unit Tests

The main testing module for rfgb must be ran from the base of the project repository.

For example:

python rfgb/tests/tests.py

Verbosity may be explicitly set by passing an integer with the -v flag. The value will be passed into unittest.TextTestRunner, so integers higher than 1 will lead to more verbose outputs.

python rfgb/tests/tests.py -v 2

Testing Individual Modules

Individual modules may be tested with unittest via the command line.

python -m unittest rfgb/tests/rfgbtests/test_Utils.py
.......
--------------------------------------------------
Ran 7 tests in 0.005s

OK

Commandline

Workflow

CLI interface for performing learning and inference with different types of statistical relational learning methods, and managing these learned models for particular data sets.

Some of the ideas built here are shamelessly inspired by Git, so the workflows for using the commandline interface to rfgb should hopefully feel somewhat familiar to those familiar with version control.

$ pip install rfgb
$ rfgb --help
usage: rfgb [-h] [-V] {init,learn,infer} ...

rfgb: Relational Functional Gradient Boosting is a gradient-boosting
approach to learning statistical relational models.

optional arguments:
  -h, --help          show this help message and exit
  -V, --version       show version number and exit

rfgb Subcommands:
  Commands and subcommands for rfgb.

  {init,learn,infer}  $ rfgb --help
    init              Initialize a .rfgb directory.
    learn             Learn various SRL models.
    infer             Infer with various SRL models.

Assuming you start with a training and test set (we’ll talk about those later), you can initialize a place where your models and meta-data will be stored.

$ rfgb init

This creates a .rfgb directory containing a models directory.

Data

As data scientists, a great deal of time is often spent getting data into a particular format. It is overly-ambitious to claim that we have solved this problem, but we try to reduce the time spent cleaning data.

The format we use is similar to Prolog, but with a clear distinction between data and programs.

Machine Learning is often described as learning a function over a vector \(x\) such that we can learn a target value \(y\).

\[f(x) = P( y | x )\]

Defining Terms

The terms we invoke to describe these functions are Positives, Negatives, Facts, and Background Knowledge.

  • Positive examples are true (or correct) examples that we want to learn from.
  • Negative examples are false (or incorrect), examples that we do not want to do.
  • Facts are features we use to learn. We make the assumption that some combination of the facts can be used to distinguish between positives and negatives.
  • Background Knowledge comes in many forms, but is a way to introduce more information to learn more effectively. If a classifier is learning to distinguish handwritten digits, extra negative examples might be created by rotating digits. Background Knowledge about this domain might involve not rotating “6” and “9”, since they are identical when rotated.

Background Knowledge is often described as the “black magic” or “expert knowledge” in machine learning. Many of our methods are designed to effectively incorporate this kind of knowledge, and solicit it in a variety of ways.

Format

Positives, negatives, and facts are contained in pos.txt, neg.txt, and facts.txt. Some examples are contained in the testDomains directory at the base of this repository.

For example: testDomains/HeartAttack/train/:

pos.txt neg.txt facts.txt
ha(p1)
ha(p6)
ha(p2)
ha(p3)
ha(p4)
ha(p5)
ha(p7)
ha(p8)
ha(p9)
ha(p10)
chol(p1,high)
race(p1,r1)
chol(p2,medium)
race(p2,r1)
chol(p3,medium)
race(p3,r1)
chol(p4,medium)
race(p4,r1)
chol(p5,low)
race(p5,r1)
chol(p6,high)
race(p6,r2)
chol(p7,medium)
race(p7,r2)
chol(p8,medium)
race(p8,r2)
chol(p9,medium)
race(p9,r2)
chol(p10,low)
race(p10,r2)
ha(person)
chol(+person,[low;medium;high])
race(+person,[r1;r2])

Positve

The latter are inspired by the FOIL method and paper.

rfgb Submodules

Submodules found in rfgb.

rfgb.boosting module

Core methods for performing learning and inference, such as computing gradients, updating gradients, and performing inference.

Documentation

rfgb.boosting.computeAdviceGradient(example)[source]

Proves each clause (Prover.prove()) and computes the advice gradient as NumberTrue - NumberFalse.

Parameters:example
rfgb.boosting.computeSumOfGradients(example, trees, data)[source]

Computes new gradients for an example.

Parameters:
  • example
  • trees
  • data
rfgb.boosting.inferTreeValue(clauses, query, data)[source]

Returns the probability of query given data and the clauses learned.

Parameters:
  • clauses
  • query
  • data
rfgb.boosting.performInference(testData, trees)[source]

Computes the probabilities for test examples.

Parameters:
  • testData (utils.Data object.) – Data for testing.
  • trees (list.) – List of strings representing learned decision trees.

Example:

from rfgb.boosting import performInference
rfgb.boosting.updateGradients(data, trees, loss='LS', delta=None)[source]

Update gradients of the data.

Parameters:
  • data (utils.Data object.) – Training or testing data (with parameters).
  • trees (list.) – List of strings representing trees.
  • loss (str.) – Loss function for regression (currently implemented: ‘LS’, ‘LAD’, ‘Huber’).
  • delta (float) – Delta value for Huber loss.

Example:

from rfgb.boosting import updateGradients

rfgb.logic module

(docstring)

class rfgb.logic.Goal(rule, parent=None, env={})[source]

Bases: object

class for each goal in rule during prolog search

class rfgb.logic.Logic[source]

Bases: object

Class for logic operations.

static constantsPresentInLiteral(literalTypeSpecification)[source]

Returns true if constants present in type specification.

static generateTests(literalName, literalTypeSpecification, clause)[source]

Generates tests for literal according to modes and types.

static getVariables(literal)[source]

Returns variables in the literal.

class rfgb.logic.Prover[source]

Bases: object

class for prolog style proof of query

goalId = 100
static prove(data, example, clause)[source]

Proves if example satisfies clause given the data. Returns True if it satisfies, else return False.

Prover.rules: contains all of the rules. Prover.trace: If this is 1, displays the proof tree. Prover.goalID: stores the goal ID.

rules = []
static search(term)[source]

Method to perform prolog style query search.

trace = 0
static unify(srcTerm, srcEnv, destTerm, destEnv)[source]

Unification method.

class rfgb.logic.Rule(s)[source]

Bases: object

Class for logic rules in prolog proof.

class rfgb.logic.Term(s)[source]

Bases: object

Class for term in prolog proof.

rfgb.tree module

Data structures and methods for learning decision trees.

class rfgb.tree.node(test=None, examples=None, information=None, level=None, parent=None, pos=None)[source]

Bases: object

A node in a tree.

Parameters:
  • expandQueue – Breadth first search node expansion strategy
  • depth – initial depth is 0 because no node present
  • maxDepth – max depth set to 1 because we want to at least learn a tree of depth 1
  • learnedDecisionTree – this will hold all the clauses learned
  • data – stores all the facts, positive and negative examples
data = None
depth = 0
expandOnBestTest(data=None)[source]

Expand the node based on the best test.

expandQueue = []
getTrueExamples(clause, test, data)[source]

Returns all examples that satisfy the clause with conjoined test literal.

static initTree(trainingData)[source]

Create the root node of the tree.

static learnTree(data)[source]

Method to create and learn the decision tree.

learnedDecisionTree = []
maxDepth = 1
static setMaxDepth(depth)[source]

Set the maximum depth of the tree.

rfgb.utils module

(docstring for utils)

class rfgb.utils.Data(regression=False, advice=False, softm=False, alpha=0.0, beta=0.0)[source]

Bases: object

Object containing the relational data.

getExampleTrueValue(example)[source]

Returns true regression value of an example for regression learning.

getFacts()[source]

returns the facts in the data

getLiterals()[source]

gets all the literals in the facts

getTarget()[source]

Returns the target.

getValue(example)[source]

Returns the regression value for an example.

Example:

from rfgb.utils import Utils
from rfgb.utils import Data

trainingData = Utils.readTrainingData('cancer',
                    path='testDomain/ToyCancer/train/')

x = trainingData.getValue('cancer(earl)')
# x == -0.5, since earl doesn't have cancer.

y = trainingData.getValue('cancer(alice)')
# y == 0.5, since alice does have cancer
setBackground(bk)[source]

Obtains the literals and their type specifications. Types can be either variable or a list of constants.

setExamples(examples, target)[source]

Set examples for regression.

setFacts(facts)[source]

Mutate the facts in the data object.

Parameters:facts (list.) – List of strings representing the facts.
Returns:None
setNeg(neg, target)[source]

Set negative examples based on the contents of a list.

setPos(pos, target)[source]

Set positive examples based on the contents of a list.

setTarget(bk, target)[source]

Sets self.target as a target string. Sets self.variableType

Parameters:
  • bk (list.) – List of strings representing modes.
  • target (str.) – Target relation or attribute.
Returns:

None

Example:

from rfgb.utils import Data

data = Data(regression=False)
background = ['friends(+person,-person)',
              'friends(-person,+person)',
              'smokes(+person)',
              'cancer(-person)']
target = 'cancer'

data.setTarget(background, target)

print(data.target)
# 'cancer(C)'
variance(examples)[source]

Calculates the variance of the regression values from a subset of the data.

class rfgb.utils.Utils[source]

Bases: object

Class of utilities used by rfgb, such as reading files, removing mode symbols, calculating Cartesian Products, etc.

UniqueVariableCollection = {'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'}
static addVariableTypes(literal)[source]

As literals are encountered, update Utils.data.variableType with the type of the variables encountered.

Parameters:literal (str.) – A literal of the form smokes(W) or friends(A,B)
static cartesianProduct(itemSets)[source]

Returns the Cartesian Product of all sets contained in the item sets.

data = None
static getleafValue(examples)[source]

returns average of regression values for examples

static load(location)[source]

Loads json version of learnedDecisionTree from location.

Parameters:location (str.) – Name of the file to load.
Returns:None.
static readTestData(target, path='test/', regression=False)[source]

Reads the testing data from files.

Parameters:
  • target (str.) – The target predicate.
  • path (str.) – Path to the training data.
  • regression (bool) – Read from examples.txt instead of pos.txt and neg.txt.
Default path:

‘train/’

Default regression:
 

False

Returns:

A Data object representing the training data.

Return type:

utils.Data

static readTrainingData(target, path='train/', regression=False, advice=False, softm=False, alpha=0.0, beta=0.0)[source]

Reads the training data from files.

Parameters:
  • target (str.) – The target predicate.
  • path (str.) – Path to the training data.
  • regression (bool) – Read from examples.txt instead of pos.txt and neg.txt.
  • advice (bool) – Read advice from an advice file, which should be contained in the same directory as the examples.
Default path:

‘train/’

Default regression:
 

False

Default advice:

False

Returns:

A Data object representing the training data.

Return type:

utils.Data

static removeModeSymbols(inputString)[source]

Returns a string with the mode symbols (+,-,#) removed.

Example:

from rfgb.utils import Utils

removeModeSymbols('#city')
# == 'city'

i = ['+drinks', '-drink', '-city']
o = list(map(removeModeSymbols, i))
# o == ['drinks', 'drink', 'city']
static save(location, saveItem)[source]

Dumps json version of learnedDecisionTree to location.

Parameters:location (str.) – Name of the file to write.
Returns:None.
static sigmoid(x)[source]
Parameters:x (int or float) – Number to apply sigmoid to.
Returns:exp(x)/float(1+exp(x))
Return type:float

rfgb.rdn package

New in version 0.3.0.

Learn and infer with relational dependency networks.

# Example script for performing learning and inference.

from rfgb import rdn

# rdn.learn requires a list of targets as strings.
trees = rdn.learn(['cancer'], path='testDomains/ToyCancer/train/')

# rdn.learn returns a dictionary mapping targets to trees.
cancer_trees = trees['cancer']

# rdn.infer classification returns a tuple of pos and neg.
results = rdn.infer('cancer', cancer_trees, path='testDomains/ToyCancer/test/')

# ({'cancer(xena)': 0.34460796550872186,
#   'cancer(yoda)': 0.34460796550872186,
#   'cancer(zod)': 0.34460796550872186},
#  {'cancer(watson)': 0.34460796550872186,
#   'cancer(voldemort)': 0.34460796550872186})

rfgb.rdn.learn module

rfgb.rdn.learn.learn(targets, numTrees=10, path='', regression=False, advice=False, softm=False, alpha=0.0, beta=0.0, saveJson=True)[source]

New in version 0.3.0.

Learn a relational dependency network from facts and positive/negative examples via relational regression trees.

Note

This currently requires that training data is stored as files on disk.

Parameters:
  • targets (list of str.) – List of target predicates to learn models for.
  • numTrees (int.) – Number of trees to learn.
  • path (str.) – Path to the location training data is stored.
  • regression (bool.) – Learn a regression model instead of classification.
  • advice (bool.) – Read an advice file from the same directory as trainPath.
Default regression:
 

False

Default advice:

False

Returns:

Dictionary where the key is the target and the value is the set of trees returned for that target.

Return type:

dict.

rfgb.rdn.infer module

rfgb.rdn.infer.infer(target, trees, path='', regression=False)[source]

New in version 0.3.0.

Perform inference on data with a set of trees.

Note

This currently requires that test data is stored as files on disk.

Parameters:
  • trees (list of str.) – Trees to perform inference with.
  • path (str.) – Path to the location test data is stored.
  • regression (bool.) – Infer with a regression model instead of classification.
Default regression:
 

False

Returns:

Tuple of results. In classification these results will be a tuple of positive and negative examples. In regression this will be the examples.

Return type:

tuple

Indices and tables