dit: discrete information theory

dit is a Python package for discrete information theory.

For a quick tour, see the Quickstart. Otherwise, work your way through the various sections. Note that all code snippets in this documentation assume that the following lines of code have already been run:

In [1]: from __future__ import division # true division for Python 2.7

In [2]: import dit

In [3]: import numpy as np

Contents:

General Information

Documentation:
http://docs.dit.io
Downloads:
Coming soon.
Dependencies:
  • Python 2.7, 3.2, 3.3, or 3.4
  • numpy
  • iterutils
  • six
  • contextlib2
  • prettytable
  • networkx
Optional Dependencies:
  • cython
Install:

Until dit is available on PyPI, the easiest way to install is:

pip install git+https://github.com/dit/dit/#egg=dit

Alternatively, you can clone this repository, move into the newly created dit directory, and then install the package:

git clone https://github.com/dit/dit.git
cd dit
pip install .
Mailing list:
None
Code and bug tracker:
https://github.com/dit/dit
License:
BSD 2-Clause, see LICENSE.txt for details.

Quickstart

The basic usage of dit corresponds to creating distributions, modifying them if need be, and then computing properties of those distributions. First, we import:

In [1]: import dit

Suppose we have a really thick coin, one so thick that there is a reasonable chance of it landing on its edge. Here is how we might represent the coin in dit.

In [2]: d = dit.Distribution(['H', 'T', 'E'], [.4, .4, .2])

In [3]: print(d)
Class:          Distribution
Alphabet:       ('E', 'H', 'T') for all rvs
Base:           linear
Outcome Class:  str
Outcome Length: 1
RV Names:       None

x   p(x)
E   0.2
H   0.4
T   0.4

Calculate the probability of \(H\) and also of the combination: \(H~\mathbf{or}~T\).

In [4]: d['H']
Out[4]: 0.40000000000000002

In [5]: d.event_probability(['H','T'])
Out[5]: 0.80000000000000004

Calculate the Shannon entropy and extropy of the joint distribution.

In [6]: dit.shannon.entropy(d)
Out[6]: 1.5219280948873621

In [7]: dit.other.extropy(d)
Out[7]: 1.1419011889093373

Create a distribution representing the \(\mathbf{xor}\) logic function. Here, we have two inputs, \(X\) and \(Y\), and then an output \(Z = \mathbf{xor}(X,Y)\).

In [8]: import dit.example_dists

In [9]: d = dit.example_dists.Xor()

In [10]: d.set_rv_names(['X', 'Y', 'Z'])

In [11]: print(d)
Class:          Distribution
Alphabet:       ('0', '1') for all rvs
Base:           linear
Outcome Class:  str
Outcome Length: 3
RV Names:       ('X', 'Y', 'Z')

x     p(x)
000   0.25
011   0.25
101   0.25
110   0.25

Calculate the Shannon mutual informations \(\I[X:Z]\), \(\I[Y:Z]\), and \(\I[X,Y:Z]\).

In [12]: dit.shannon.mutual_information(d, ['X'], ['Z'])
Out[12]: 0.0

In [13]: dit.shannon.mutual_information(d, ['Y'], ['Z'])
Out[13]: 0.0

In [14]: dit.shannon.mutual_information(d, ['X', 'Y'], ['Z'])
Out[14]: 1.0

Calculate the marginal distribution \(P(X,Z)\). Then print its probabilities as fractions, showing the mask.

In [15]: d2 = d.marginal(['X', 'Z'])

In [16]: print(d2.to_string(show_mask=True, exact=True))
Class:          Distribution
Alphabet:       ('0', '1') for all rvs
Base:           linear
Outcome Class:  str
Outcome Length: 2 (mask: 3)
RV Names:       ('X', 'Z')

x     p(x)
0*0   1/4
0*1   1/4
1*0   1/4
1*1   1/4

Convert the distribution probabilities to log (base 3.5) probabilities, and access its probability mass function.

In [17]: d2.set_base(3.5)

In [18]: d2.pmf
Out[18]: array([-1.10658951, -1.10658951, -1.10658951, -1.10658951])

Draw 5 random samples from this distribution.

In [19]: d2.rand(5)
Out[19]: ['01', '10', '00', '01', '00']

Enjoy!

Notation

dit is a scientific tool, and so, much of this documentation will contain mathematical expressions. Here we will describe this notation.

Basic Notation

A random variable \(X\) consists of outcomes \(x\) from an alphabet \(\mathcal{X}\). As such, we write the entropy of a distribution as \(\H[X] = \sum_{x \in \mathcal{X}} p(x) \log_2 p(x)\), where \(p(x)\) denote the probability of the outcome \(x\) occuring.

Many distributions are joint distribution. In the absence of variable names, we index each random variable with a subscript. For example, a distribution over three variables is written \(X_0X_1X_2\). As a shorthand, we also denote those random variables as \(X_{0:3}\), meaning start with \(X_0\) and go through, but not including \(X_3\) — just like python slice notation.

If we ever need to describe an infinitely long chain of variables we drop the index from the side that is infinite. So \(X_{:0} = \ldots X_{-3}X_{-2}X_{-1}\) and \(X_{0:} = X_0X_1X_2\ldots\). For an arbitrary set of indices \(A\), the corresponding collection of random variables is denoted \(X_A\). For example, if \(A = \{0,2,4\}\), then \(X_A = X_0 X_2 X_4\). The complement of \(A\) (with respect to some universal set) is denoted \(\bar{A}\).

Furthermore, we define \(0 \log_2 0 = 0\).

Advanced Notation

When there exists a function \(Y = f(X)\) we write \(X \imore Y\) meaning that \(X\) is informationally richer than \(Y\). Similarly, if \(f(Y) = X\) then we write \(X \iless Y\) and say that \(X\) is informationally poorer than \(Y\). If \(X \iless Y\) and \(X \imore Y\) then we write \(X \ieq Y\) and say that \(X\) is informationally equivalent to \(Y\). Of all the variables that are poorer than both \(X\) and \(Y\), there is a richest one. This variable is known as the meet of \(X\) and \(Y\) and is denoted \(X \meet Y\). By definition, \(\forall Z s.t. Z \iless X\) and \(Z \iless Y, Z \iless X \meet Y\). Similarly of all variables richer than both \(X\) and \(Y\), there is a poorest. This variable is known as the join of \(X\) and \(Y\) and is denoted \(X \join Y\). The joint random variable \((X,Y)\) and the join are informationally equivalent: \((X,Y) \ieq X \join Y\).

Lastly, we use \(X \mss Y\) to denote the minimal sufficient statistic of \(X\) about the random variable \(Y\).

Distributions

Here we describe how to create, modify, and manipulate distribution objects.

Numpy-based ScalarDistribution

ScalarDistribution specific stuff.

ScalarDistribution.__init__(outcomes, pmf=None, sample_space=None, base=None, prng=None, sort=True, sparse=True, trim=True, validate=True)

Initialize the distribution.

Parameters:
  • outcomes (sequence, dict) – The outcomes of the distribution. If outcomes is a dictionary, then the keys are used as outcomes, and the values of the dictionary are used as pmf instead. Note: an outcome is any hashable object (except None) which is equality comparable. If sort is True, then outcomes must also be orderable.
  • pmf (sequence) – The outcome probabilities or log probabilities. If None, then outcomes is treated as the probability mass function and the outcomes are consecutive integers beginning from zero.
  • sample_space (sequence) – A sequence representing the sample space, and corresponding to the complete set of possible outcomes. The order of the sample space is important. If None, then the outcomes are used to determine the sample space instead.
  • base (float, None) – If pmf specifies log probabilities, then base should specify the base of the logarithm. If ‘linear’, then pmf is assumed to represent linear probabilities. If None, then the value for base is taken from ditParams[‘base’].
  • prng (RandomState) – A pseudo-random number generator with a rand method which can generate random numbers. For now, this is assumed to be something with an API compatible to NumPy’s RandomState class. This attribute is initialized to equal dit.math.prng.
  • sort (bool) – If True, then the sample space is sorted before finalizing it. Usually, this is desirable, as it normalizes the behavior of distributions which have the same sample space (when considered as a set). Note that addition and multiplication of distributions is defined only if the sample spaces (as tuples) are equal.
  • sparse (bool) – Specifies the form of the pmf. If True, then outcomes and pmf will only contain entries for non-null outcomes and probabilities, after initialization. The order of these entries will always obey the order of sample_space, even if their number is not equal to the size of the sample space. If False, then the pmf will be dense and every outcome in the sample space will be represented.
  • trim (bool) – Specifies if null-outcomes should be removed from pmf when make_sparse() is called (assuming sparse is True) during initialization.
  • validate (bool) – If True, then validate the distribution. If False, then assume the distribution is valid, and perform no checks.
Raises:
  • InvalidDistribution – If the length of values and outcomes are unequal.
  • See validate() for a list of other potential exceptions.

Numpy-based Distribution

The primary method of constructing a distribution is by supplying both the outcomes and the probability mass function:

In [1]: from dit import Distribution

In [2]: outcomes = ['000', '011', '101', '110']

In [3]: pmf = [1/4]*4

In [4]: xor = Distribution(outcomes, pmf)

In [5]: print(xor)
Class:          Distribution
Alphabet:       ('0', '1') for all rvs
Base:           linear
Outcome Class:  str
Outcome Length: 3
RV Names:       None

x     p(x)
000   0.25
011   0.25
101   0.25
110   0.25

Another way to construct a distribution is by supplying a dictionary mapping outcomes to probabilities:

In [6]: outcomes_probs = {'000': 1/4, '011': 1/4, '101': 1/4, '110': 1/4}

In [7]: xor2 = Distribution(outcomes_probs)

In [8]: print(xor2)
Class:          Distribution
Alphabet:       ('0', '1') for all rvs
Base:           linear
Outcome Class:  str
Outcome Length: 3
RV Names:       None

x     p(x)
000   0.25
011   0.25
101   0.25
110   0.25
Distribution.__init__(outcomes, pmf=None, sample_space=None, base=None, prng=None, sort=True, sparse=True, trim=True, validate=True)

Initialize the distribution.

Parameters:
  • outcomes (sequence, dict) – The outcomes of the distribution. If outcomes is a dictionary, then the keys are used as outcomes, and the values of the dictionary are used as pmf instead. The values will not be used if probabilities are passed in via pmf. Outcomes must be hashable, orderable, sized, iterable containers. The length of an outcome must be the same for all outcomes, and every outcome must be of the same type.
  • pmf (sequence, None) – The outcome probabilities or log probabilities. pmf can be None only if outcomes is a dict.
  • sample_space (sequence, CartesianProduct) – A sequence representing the sample space, and corresponding to the complete set of possible outcomes. The order of the sample space is important. If None, then the outcomes are used to determine a Cartesian product sample space instead.
  • base (float, None) – If pmf specifies log probabilities, then base should specify the base of the logarithm. If ‘linear’, then pmf is assumed to represent linear probabilities. If None, then the value for base is taken from ditParams[‘base’].
  • prng (RandomState) – A pseudo-random number generator with a rand method which can generate random numbers. For now, this is assumed to be something with an API compatibile to NumPy’s RandomState class. This attribute is initialized to equal dit.math.prng.
  • sort (bool) – If True, then each random variable’s alphabets are sorted before they are finalized. Usually, this is desirable, as it normalizes the behavior of distributions which have the same sample spaces (when considered as a set). Note that addition and multiplication of distributions is defined only if the sample spaces are compatible.
  • sparse (bool) – Specifies the form of the pmf. If True, then outcomes and pmf will only contain entries for non-null outcomes and probabilities, after initialization. The order of these entries will always obey the order of sample_space, even if their number is not equal to the size of the sample space. If False, then the pmf will be dense and every outcome in the sample space will be represented.
  • trim (bool) – Specifies if null-outcomes should be removed from pmf when make_sparse() is called (assuming sparse is True) during initialization.
  • validate (bool) – If True, then validate the distribution. If False, then assume the distribution is valid, and perform no checks.
Raises:
  • InvalidDistribution – If the length of values and outcomes are unequal. If no outcomes can be obtained from pmf and outcomes is None.
  • See validate() for a list of other potential exceptions.

To verify that these two distributions are the same, we can use the is_approx_equal method:

In [9]: xor.is_approx_equal(xor2)
Out[9]: True
Distribution.is_approx_equal(other, rtol=None, atol=None)

Returns True is other is approximately equal to this distribution.

For two distributions to be equal, they must have the same sample space and must also agree on the probabilities of each outcome.

Parameters:
  • other (distribution) – The distribution to compare against.
  • rtol (float) – The relative tolerance to use when comparing probabilities. See dit.math.close() for more information.
  • atol (float) – The absolute tolerance to use when comparing probabilities. See dit.math.close() for more information.

Notes

The distributions need not have the length, but they must have the same base.

Operations

There are several operations possible on joint random variables.

Marginal

Distribution.marginal(rvs, rv_mode=None)

Returns a marginal distribution.

Parameters:
  • rvs (list) – The random variables to keep. All others are marginalized.
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of self._rv_mode is consulted.
Returns:

d – A new joint distribution with the random variables in rvs kept and all others marginalized.

Return type:

joint distribution

Conditional

Distribution.condition_on(crvs, rvs=None, rv_mode=None, extract=False)

Returns distributions conditioned on random variables crvs.

Optionally, rvs specifies which random variables should remain.

NOTE: Eventually this will return a conditional distribution.

Parameters:
  • crvs (list) – The random variables to condition on.
  • rvs (list, None) – The random variables for the resulting conditional distributions. Any random variable not represented in the union of crvs and rvs will be marginalized. If None, then every random variable not appearing in crvs is used.
  • rv_mode (str, None) – Specifies how to interpret crvs and rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random varible names. If None, then the value of self._rv_mode is consulted, which defaults to ‘indices’.
  • extract (bool) – If the length of either crvs or rvs is 1 and extract is True, then instead of the new outcomes being 1-tuples, we extract the sole element to create scalar distributions.
Returns:

  • cdist (dist) – The distribution of the conditioned random variables.
  • dists (list of distributions) – The conditional distributions for each outcome in cdist.

Examples

First we build a distribution P(X,Y,Z) representing the XOR logic gate.

>>> pXYZ = dit.example_dists.Xor()
>>> pXYZ.set_rv_names('XYZ')

We can obtain the conditional distributions P(X,Z|Y) and the marginal of the conditioned variable P(Y) as follows:

>>> pY, pXZgY = pXYZ.condition_on('Y')

If we specify rvs='Z', then only ‘Z’ is kept and thus, ‘X’ is marginalized out:

>>> pY, pZgY = pXYZ.condition_on('Y', rvs='Z')

We can condition on two random variables:

>>> pXY, pZgXY = pXYZ.condition_on('XY')

The equivalent call using indexes is:

>>> pXY, pZgXY = pXYZ.condition_on([0, 1], rv_mode='indexes')

Join

We can construct the join of two random variables:

\[X \join Y = \min \{ V | V \imore X \land V \imore Y \}\]

Where \(\min\) is understood to be minimizing with respect to the entropy.

join(dist, rvs, rv_mode=None, int_outcomes=True)

Returns the distribution of the join of random variables defined by rvs.

Parameters:
  • dist (Distribution) – The distribution which defines the base sigma-algebra.
  • rvs (list) – A list of lists. Each list specifies a random variable to be joined with the other lists. Each random variable can defined as a series of unique indexes. Multiple random variables can use the same index. For example, [[0,1],[1,2]].
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
  • int_outcomes (bool) – If True, then the outcomes of the join are relabeled as integers instead of as the atoms of the induced sigma-algebra.
Returns:

d – The distribution of the join.

Return type:

ScalarDistribution

insert_join(dist, idx, rvs, rv_mode=None)

Returns a new distribution with the join inserted at index idx.

The join of the random variables in rvs is constructed and then inserted into at index idx.

Parameters:
  • dist (Distribution) – The distribution which defines the base sigma-algebra.
  • idx (int) – The index at which to insert the join. To append the join, set idx to be equal to -1 or dist.outcome_length().
  • rvs (list) – A list of lists. Each list specifies a random variable to be met with the other lists. Each random variable can defined as a series of unique indexes. Multiple random variables can use the same index. For example, [[0,1],[1,2]].
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
Returns:

d – The new distribution with the join at index idx.

Return type:

Distribution

Meet

We can construct the meet of two random variabls:

\[X \meet Y = \max \{ V | V \iless X \land V \iless Y \}\]

Where \(\max\) is understood to be maximizing with respect to the entropy.

meet(dist, rvs, rv_mode=None, int_outcomes=True)

Returns the distribution of the meet of random variables defined by rvs.

Parameters:
  • dist (Distribution) – The distribution which defines the base sigma-algebra.
  • rvs (list) – A list of lists. Each list specifies a random variable to be met with the other lists. Each random variable can defined as a series of unique indexes. Multiple random variables can use the same index. For example, [[0,1],[1,2]].
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
  • int_outcomes (bool) – If True, then the outcomes of the meet are relabeled as integers instead of as the atoms of the induced sigma-algebra.
Returns:

d – The distribution of the meet.

Return type:

ScalarDistribution

insert_meet(dist, idx, rvs, rv_mode=None)

Returns a new distribution with the meet inserted at index idx.

The meet of the random variables in rvs is constructed and then inserted into at index idx.

Parameters:
  • dist (Distribution) – The distribution which defines the base sigma-algebra.
  • idx (int) – The index at which to insert the meet. To append the meet, set idx to be equal to -1 or dist.outcome_length().
  • rvs (list) – A list of lists. Each list specifies a random variable to be met with the other lists. Each random variable can defined as a series of unique indexes. Multiple random variables can use the same index. For example, [[0,1],[1,2]].
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
Returns:

d – The new distribution with the meet at index idx.

Return type:

Distribution

Minimal Sufficient Statistic

This method constructs the minimal sufficient statistic of \(X\) about \(Y\): \(X \mss Y\):

\[X \mss Y = \min \{ V | V \iless X \land \I[X:Y] = \I[V:Y] \}\]

Again, \(\min\) is understood to be over entropies.

mss(dist, rvs, about=None, rv_mode=None, int_outcomes=True)
Parameters:
  • dist (Distribution) – The distribution which defines the base sigma-algebra.
  • rvs (list) – A list of random variables to be compressed into a minimal sufficient statistic.
  • about (list) – A list of random variables for which the minimal sufficient static will retain all information about.
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
  • int_outcomes (bool) – If True, then the outcomes of the minimal sufficient statistic are relabeled as integers instead of as the atoms of the induced sigma-algebra.
Returns:

d – The distribution of the minimal sufficient statistic.

Return type:

ScalarDistribution

Examples

>>> d = Xor()
>>> print(mss(d, [0], [1, 2]))
Class:    ScalarDistribution
Alphabet: (0, 1)
Base:     linear
x   p(x)
0   0.5
1   0.5
insert_mss(dist, idx, rvs, about=None, rv_mode=None)

Inserts the minimal sufficient statistic of rvs about about into dist at index idx.

Parameters:
  • dist (Distribution) – The distribution which defines the base sigma-algebra.
  • idx (int) – The location in the distribution to insert the minimal sufficient statistic.
  • rvs (list) – A list of random variables to be compressed into a minimal sufficient statistic.
  • about (list) – A list of random variables for which the minimal sufficient static will retain all information about.
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
Returns:

d – The distribution dist modified to contain the minimal sufficient statistic.

Return type:

Distribution

Examples

>>> d = Xor()
>>> print(insert_mss(d, -1, [0], [1, 2]))
Class:          Distribution
Alphabet:       ('0', '1') for all rvs
Base:           linear
Outcome Class:  str
Outcome Length: 4
RV Names:       None
x      p(x)
0000   0.25
0110   0.25
1011   0.25
1101   0.25

Information Measures

dit supports many information measures, ranging from as standard as the Shannon entropy to as exotic as Gács-Körner common information (with even more esoteric measure coming soon!). We organize these quantities into the following groups.

We first have the Shannon-like measures. These quantities are based on sums and differences of entropies, conditional entropies, or mutual informations of random variables:

Basic Shannon measures

The information on this page is drawn from the fantastic text book Elements of Information Theory by Cover and Thomas [CT06]. Other good choices are Information Theory, Inference and Learning Algorithms by MacKay [Mac03] and Information Theory and Network Coding by Yeung [Yeu08].

Entropy

The entropy measures how much information is in a random variable \(X\).

\[\H[X] = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x)\]

What do we mean by “how much information”? Basically, we mean the average number of yes-no questions one would have to ask to determine an outcome from the distribution. In the simplest case, consider a sure thing:

In [1]: d = dit.Distribution(['H'], [1])

In [2]: dit.shannon.entropy(d)
Out[2]: 0.0

So since we know that the outcome from our distribution will always be H, we have to ask zero questions to figure that out. If however we have a fair coin:

In [3]: d = dit.Distribution(['H', 'T'], [1/2, 1/2])

In [4]: dit.shannon.entropy(d)
Out[4]: 1.0

The entropy tells us that we must ask one question to determine whether an H or T was the outcome of the coin flip. Now what if there are three outcomes? Let’s consider the following situation:

In [5]: d = dit.Distribution(['A', 'B', 'C'], [1/2, 1/4, 1/4])

In [6]: dit.shannon.entropy(d)
Out[6]: 1.5

Here we find that the entropy is 1.5 bits. How do we ask one and a half questions on average? Well, if our first question is “was it A?” and it is true, then we are done, and that occurs half the time. The other half of the time we need to ask a follow up question: “was it B?”. So half the time we need to ask one question, and the other half of the time we need to ask two questions. In other words, we need to ask 1.5 questions on average.

Joint Entropy

The entropy of multiple variables is computed in a similar manner:

\[\H[X_{0:n}] = -\sum_{x_{0:n} \in X_{0:n}} p(x_{0:n}) \log_2 p(x_{0:n})\]

Its intuition is also the same: the average number of binary questions required to identify a joint event from the distribution.

API
entropy(dist, rvs=None, rv_mode=None)

Returns the entropy H[X] over the random variables in rvs.

If the distribution represents linear probabilities, then the entropy is calculated with units of ‘bits’ (base-2). Otherwise, the entropy is calculated in whatever base that matches the distribution’s pmf.

Parameters:
  • dist (Distribution or float) – The distribution from which the entropy is calculated. If a float, then we calculate the binary entropy.
  • rvs (list, None) – The indexes of the random variable used to calculate the entropy. If None, then the entropy is calculated over all random variables. This should remain None for ScalarDistributions.
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
Returns:

H – The entropy of the distribution.

Return type:

float

Conditional Entropy

The conditional entropy is the amount of information in variable \(X\) beyond that which is in variable \(Y\):

\[\H[X|Y] = -\sum_{x \in X, y \in Y} p(x, y) \log_2 p(x|y)\]

As a simple example, consider two identical variables:

In [7]: d = dit.Distribution(['HH', 'TT'], [1/2, 1/2])

In [8]: dit.shannon.conditional_entropy(d, [0], [1])
Out[8]: 0.0

We see that knowing the second variable tells us everything about the first, leaving zero entropy. On the other end of the spectrum, two independent variables:

In [9]: d = dit.Distribution(['HH', 'HT', 'TH', 'TT'], [1/4]*4)

In [10]: dit.shannon.conditional_entropy(d, [0], [1])
Out[10]: 1.0

Here, the second variable tells us nothing about the first so we are left with the one bit of information a coin flip has.

API
conditional_entropy(dist, rvs_X, rvs_Y, rv_mode=None)

Returns the conditional entropy of H[X|Y].

If the distribution represents linear probabilities, then the entropy is calculated with units of ‘bits’ (base-2).

Parameters:
  • dist (Distribution) – The distribution from which the conditional entropy is calculated.
  • rvs_X (list, None) – The indexes of the random variables defining X.
  • rvs_Y (list, None) – The indexes of the random variables defining Y.
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs_X and rvs_Y. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs_X and rvs_Y are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
Returns:

H_XgY – The conditional entropy H[X|Y].

Return type:

float

Mutual Information

The mutual information is the amount of information shared by \(X\) and \(Y\):

\[\begin{split}\I[X:Y] &= \H[X,Y] - \H[X|Y] - \H[Y|X] \\ &= \H[X] + \H[Y] - \H[X,Y] \\ &= \sum_{x \in X, y \in Y} p(x, y) \log_2 \frac{p(x, y)}{p(x)p(y)}\end{split}\]

The mutual information is symmetric:

\[\I[X:Y] = \I[Y:X]\]

Meaning that the information that \(X\) carries about \(Y\) is equal to the information that \(Y\) carries about \(X\). The entropy of \(X\) can be decomposed into the information it shares with \(Y\) and the information it doesn’t:

\[\H[X] = \I[X:Y] + \H[X|Y]\]

See also

The mutual information generalized to the multivariate case in three different ways:

Co-Information
Generalized as the information which all variables contribute to.
Total Correlation
Generalized as the sum of the information in the individual variables minus the information in the whole.
Binding Information
Generalized as the joint entropy minus the entropy of each variable conditioned on the others.
API
mutual_information(dist, rvs_X, rvs_Y, rv_mode=None)

Returns the mutual information I[X:Y].

If the distribution represents linear probabilities, then the entropy is calculated with units of ‘bits’ (base-2).

Parameters:
  • dist (Distribution) – The distribution from which the mutual information is calculated.
  • rvs_X (list, None) – The indexes of the random variables defining X.
  • rvs_Y (list, None) – The indexes of the random variables defining Y.
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
Returns:

I – The mutual information I[X:Y].

Return type:

float

Visualization of Information

It has been shown that there is a correspondence between set-theoretic measures and information-theoretic measures. The entropy is equivalent to set cardinality, mutual information to set intersection, and conditional entropy to set difference. Because of this we can use Venn-like diagrams to represent the information in and shared between random variables. These diagrams are called information diagrams or i-diagrams for short.

This first image pictographically shades the area of the i-diagram which contains the information corresponding to \(\H[X_0]\).

The entropy :math:`\H[X_0]`

Similarly, this one shades the information corresponding to \(\H[X_1]\).

The entropy :math:`\H[X_1]`

This image shades the information corresponding to \(\H[X_0, X_1]\). Notice that it is the union of the prior two, and not their sum (e.g. that overlap region is not double-counted).

The joint entropy :math:`\H[X_0,X_1]`

Next, the conditional entropy of \(X_0\) conditioned on \(X_1\), \(\H[X_0|X_1]\), is displayed. It consists of the area contained in the \(X_0\) circle but not contained in \(X_1\) circle.

The conditional entropy :math:`\H[X_0|X_1]`

In the same vein, here the conditional entropy \(\H[X_1|X_0]\) is shaded.

The conditional entropy :math:`\H[X_1|X_0]`

Finally, the mutual information between \(X_0\) and \(X_1\), \(I[X_0:X_1]\) is drawn. It is the region where the two circles overlap.

The mutual information :math:`\I[X_0:X_1]`

Multivariate

Multivariate measures of information generally attempt to capture some global property of a joint distribution. For example, they might attempt to quantify how much information is shared among the random variables, or quantify how “non-indpendent” in the joint distribution is.

Shannon Measures

The first batch are all linear combinations of atoms from an I-diagram, and are therefore computable using only the entropy of subsets of the distribution.

Entropy

The entropy measures the total amount of information contained in a set of random variables, \(X_{0:n}\), potentially excluding the information contain in others, \(Y_{0:m}\).

\[\begin{split}\H[X_{0:n} | Y_{0:m}] = -\sum_{\substack{x_{0:n} \in \mathcal{X}_{0:n} \\ y_{0:m} \in \mathcal{Y}_{0:m}}} p(x_{0:n}, y_{0:m}) \log_2 p(x_{0:n}|y_{0:m})\end{split}\]

Let’s consider two coins that are interdependent: the first coin fips fairly, and if the first comes up heads, the other is fair, but if the firt comes up tails the other is certainly tails:

In [1]: d = dit.Distribution(['HH', 'HT', 'TT'], [1/4, 1/4, 1/2])

We would expect that entropy of the second coin conditioned on the first coin would be \(0.5\) bits, and sure enough that is what we find:

In [2]: from dit.multivariate import entropy

In [3]: entropy(d, [1], [0])
Out[3]: 0.49999999999999989

And since the first coin is fair, we would expect it to have an entropy of \(1\) bit:

In [4]: entropy(d, [0])
Out[4]: 1.0

Taken together, we would then expect the joint entropy to be \(1.5\) bits:

In [5]: entropy(d)
Out[5]: 1.5
Visualization

Below we have a pictoral representation of the joint entropy for both 2 and 3 variable joint distributions.

The entropy :math:`\H[X,Y]` The entropy :math:`\H[X,Y,Z]`
API
entropy(dist, rvs=None, crvs=None, rv_mode=None)

Compute the conditional joint entropy.

Parameters:
  • dist (Distribution) – The distribution from which the entropy is calculated.
  • rvs (list, None) – The indexes of the random variable used to calculate the entropy. If None, then the entropy is calculated over all random variables.
  • crvs (list, None) – The indexes of the random variables to condition on. If None, then no variables are conditioned on.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

H – The entropy.

Return type:

float

Raises:

ditException – Raised if rvs or crvs contain non-existant random variables.

Examples

Let’s construct a 3-variable distribution for the XOR logic gate and name the random variables X, Y, and Z.

>>> d = dit.example_dists.Xor()
>>> d.set_rv_names(['X', 'Y', 'Z'])

The joint entropy of H[X,Y,Z] is:

>>> dit.multivariate.entropy(d, 'XYZ')
2.0

We can do this using random variables indexes too.

>>> dit.multivariate.entropy(d, [0,1,2], rv_mode='indexes')
2.0

The joint entropy H[X,Z] is given by:

>>> dit.multivariate.entropy(d, 'XZ')
1.0

Conditional entropy can be calculated by passing in the conditional random variables. The conditional entropy H[Y|X] is:

>>> dit.multivariate.entropy(d, 'Y', 'X')
1.0
Co-Information

The co-information [Bel03] is one generalization of the mutual information to multiple variables. The co-information quantifies the amount of infomration that all variables participate in. It is defined via an inclusion/exclusion sum:

\[\begin{split}\I[X_{0:n}] &= -\sum_{y \in \mathcal{P}(\{0..n\})} (-1)^{|y|} \H[X_y] \\ &= \sum_{x_{0:n} \in X_{0:n}} p(x_{0:n}) \log_2 \prod_{y \in \mathcal{P}(\{0..n\})} p(y)^{(-1)^{|y|}}\end{split}\]

It is clear that the co-information measures the “center-most” atom of the diagram only, which is the only atom to which every variable contributes. To exemplifying this, consider “giant bit” distributions:

In [1]: from dit import Distribution as D

In [2]: from dit.multivariate import coinformation as I

In [3]: [ I(D(['0'*n, '1'*n], [1/2, 1/2])) for n in range(2, 6) ]
Out[3]: [1.0, 1.0, 1.0, 1.0]

This verifies intuition that the entire one bit of the distribution’s entropy is condensed in a single atom. One notable property of the co-information is that for \(n \geq 3\) it can be negative. For example:

In [4]: from dit.example_dists import Xor

In [5]: d = Xor()

In [6]: I(d)
Out[6]: -1.0

Based on these two examples one might get the impression that the co-information is positive for “redundant” distributions and negative for “synergistic” distributions. This however is not true — consider the four-variable parity distribution:

In [7]: from dit.example_dists import n_mod_m

In [8]: d = n_mod_m(4, 2)

In [9]: I(d)
Out[9]: 1.0

Meaning that the co-information is positive for both the most redundant distribution, the giant bit, and the most synergistic, the parity. Therefore the coinformation can not be used to measure redundancy or synergy.

Note

Correctly measuring redundancy and synergy is an ongoing problem. See [GCJ+13] and references therein for the current status of the problem.

Visualization

The co-information can be visuallized on an i-diagram as below, where only the centermost atom is shaded:

The co-information :math:`\I[X:Y:Z]`
API
coinformation(dist, rvs=None, crvs=None, rv_mode=None)

Calculates the coinformation.

Parameters:
  • dist (Distribution) – The distribution from which the coinformation is calculated.
  • rvs (list, None) – The indexes of the random variable used to calculate the coinformation between. If None, then the coinformation is calculated over all random variables.
  • crvs (list, None) – The indexes of the random variables to condition on. If None, then no variables are condition on.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

I – The coinformation.

Return type:

float

Raises:

ditException – Raised if dist is not a joint distribution or if rvs or crvs contain non-existant random variables.

Examples

Let’s construct a 3-variable distribution for the XOR logic gate and name the random variables X, Y, and Z.

>>> d = dit.example_dists.Xor()
>>> d.set_rv_names(['X', 'Y', 'Z'])

To calculate coinformations, recall that rvs specifies which groups of random variables are involved. For example, the 3-way mutual information I[X:Y:Z] is calculated as:

>>> dit.multivariate.coinformation(d, ['X', 'Y', 'Z'])
-1.0

It is a quirk of strings that each element of a string is also an iterable. So an equivalent way to calculate the 3-way mutual information I[X:Y:Z] is:

>>> dit.multivariate.coinformation(d, 'XYZ')
-1.0

The reason this works is that list(‘XYZ’) == [‘X’, ‘Y’, ‘Z’]. If we want to use random variable indexes, we need to have explicit groupings:

>>> dit.multivariate.coinformation(d, [[0], [1], [2]], rv_mode='indexes')
-1.0

To calculate the mutual information I[X, Y : Z], we use explicit groups:

>>> dit.multivariate.coinformation(d, ['XY', 'Z'])

Using indexes, this looks like:

>>> dit.multivariate.coinformation(d, [[0, 1], [2]], rv_mode='indexes')

The mutual information I[X:Z] is given by:

>>> dit.multivariate.coinformation(d, 'XZ')
0.0

Equivalently,

>>> dit.multivariate.coinformation(d, ['X', 'Z'])
0.0

Using indexes, this becomes:

>>> dit.multivariate.coinformation(d, [[0], [2]])
0.0

Conditional mutual informations can be calculated by passing in the conditional random variables. The conditional entropy I[X:Y|Z] is:

>>> dit.multivariate.coinformation(d, 'XY', 'Z')
1.0

Using indexes, this becomes:

>>> rvs = [[0], [1]]
>>> crvs = [[2]] # broken
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
1.0

For the conditional random variables, groupings have no effect, so you can also obtain this as:

>>> rvs = [[0], [1]]
>>> crvs = [2]
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
1.0

Finally, note that entropy can also be calculated. The entropy H[Z|XY] is obtained as:

>>> rvs = [[2]]
>>> crvs = [[0], [1]] # broken
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
0.0
>>> crvs = [[0, 1]] # broken
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
0.0
>>> crvs = [0, 1]
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
0.0
>>> rvs = 'Z'
>>> crvs = 'XY'
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
0.0

Note that [[0], [1]] says to condition on two groups. But conditioning is a flat operation and doesn’t respect the groups, so it is equal to a single group of 2 random variables: [[0, 1]]. With random variable names ‘XY’ is acceptable because list(‘XY’) = [‘X’, ‘Y’], which is species two singleton groups. By the previous argument, this is will be treated the same as [‘XY’].

Interaction Information

The interaction information is equal in magnitude to the co-information, but has the opposite sign when taken over an odd number of variables:

\[\begin{split}\II(X_{0:n}) &= (-1)^{n} \cdot \I(X_{0:n})\end{split}\]

Interaction information was first studied in the 3-variable case which, for \(X_{0:3} = X_0X_1X_2\), takes the following form:

\[\II(X_0:X_1:X_2) = \I(X_0:X_1|X_2) - \I(X_0:X_1)\]

The extension to \(n > 3\) proceeds recursively. For example,

\[\begin{split}\II(X_0:X_1:X_2:X_3) &= \II(X_0:X_1:X_2|X_3) - \II(X_0:X_1:X_2) \\ &= \I(X_0:X_1|X_2,X_3) - \I(X_0:X_1|X_3) \\ &\qquad - \I(X_0:X_1|X_2) + \I(X_0:X_1)\end{split}\]

See also

For more information, see Co-Information.

API
interaction_information(dist, rvs=None, crvs=None, rv_mode=None)

Calculates the interaction information.

Parameters:
  • dist (Distribution) – The distribution from which the interaction information is calculated.
  • rvs (list, None) – The indexes of the random variable used to calculate the interaction information between. If None, then the interaction information is calculated over all random variables.
  • crvs (list, None) – The indexes of the random variables to condition on. If None, then no variables are condition on.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

II – The interaction information.

Return type:

float

Raises:

ditException – Raised if dist is not a joint distribution or if rvs or crvs contain non-existant random variables.

Total Correlation

The total correlation [Wat60], denoted \(\T\), also known as the multi-information or integration, is one generalization of the mutual information. It is defined as the amount of information each individual variable carries above and beyond the joint entropy, e.g. the difference between the whole and the sum of its parts:

\[\begin{split}\T[X_{0:n}] &= \sum \H[X_i] - \H[X_{0:n}] \\ &= \sum_{x_{0:n} \in X_{0:n}} p(x_{0:n}) \log_2 \frac{p(x_{0:n})}{\prod p(x_i)}\end{split}\]

Two nice features of the total correlation are that it is non-negative and that it is zero if and only if the random variables \(X_{0:n}\) are all independent. Some baseline behavior is good to note also. First its behavior when applied to “giant bit” distributions:

In [1]: from dit import Distribution as D

In [2]: from dit.multivariate import total_correlation as T

In [3]: [ T(D(['0'*n, '1'*n], [0.5, 0.5])) for n in range(2, 6) ]
Out[3]: [1.0, 2.0, 3.0, 4.0]

So we see that for giant bit distributions, the total correlation is equal to one less than the number of variables. The second type of distribution to consider is general parity distributions:

In [4]: from dit.example_dists import n_mod_m

In [5]: [ T(n_mod_m(n, 2)) for n in range(3, 6) ]
Out[5]: [1.0, 1.0, 1.0]

In [6]: [ T(n_mod_m(3, m)) for m in range(2, 5) ]
Out[6]: [1.0, 1.5849625007211565, 2.0]

Here we see that the total correlation is equal to \(\log_2{m}\) regardless of \(n\).

The total correlation follows a nice decomposition rule. Given two sets of (not necessarily independent) random variables, \(A\) and \(B\), the total correaltion of \(A \cup B\) is:

\[\T[A \cup B] = \T[A] + \T[B] + \I[A : B]\]
In [7]: from dit.multivariate import coinformation as I

In [8]: d = n_mod_m(4, 3)

In [9]: T(d) == T(d, [[0], [1]]) + T(d, [[2], [3]]) + I(d, [[0, 1], [2, 3]])
Out[9]: True
Visualization

The total correlation consists of all information that is shared among the variables, and weights each piece according to how many variables it is shared among.

The total correlation :math:`\T[X:Y:Z]`
API
total_correlation(dist, rvs=None, crvs=None, rv_mode=None)
Parameters:
  • dist (Distribution) – The distribution from which the total correlation is calculated.
  • rvs (list, None) – A list of lists. Each inner list specifies the indexes of the random variables used to calculate the total correlation. If None, then the total correlation is calculated over all random variables, which is equivalent to passing rvs=dist.rvs.
  • crvs (list, None) – A single list of indexes specifying the random variables to condition on. If None, then no variables are conditioned on.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

T – The total correlation.

Return type:

float

Examples

>>> d = dit.example_dists.Xor()
>>> dit.multivariate.total_correlation(d)
1.0
>>> dit.multivariate.total_correlation(d, rvs=[[0], [1]])
0.0
Raises:ditException – Raised if dist is not a joint distribution or if rvs or crvs contain non-existant random variables.
Binding Information

The binding information [AP12], or dual total correlation, is yet another generalization of the mutual information. It is the amount of information that is shared among the variables. It is defined as:

\[\begin{split}\B[X_{0:n}] &= \H[X_{0:n}] - \sum \H[X_i | X_{\{0..n\}/i}] \\ &= - \sum_{x_{0:n} \in X_{0:n}} p(x_{0:n}) \log_2 \frac{p(x_{0:n})}{\prod p(x_i|x_{\{0:n\}/i})}\end{split}\]

In a sense the binding information captures the same information that the Total Correlation does, in that both measures are zero or non-zero together. However, the two measures take on very different quantitative values for different distributions. By way of example, the type of distribution that maximizes the total correlation is a “giant bit”:

In [1]: from dit.multivariate import binding_information, total_correlation

In [2]: d = dit.Distribution(['000', '111'], [1/2, 1/2])

In [3]: total_correlation(d)
Out[3]: 2.0

In [4]: binding_information(d)
Out[4]: 1.0

For the same distribution, the binding information takes on a relatively low value. On the other hand, the type of distribution that maximizes the binding information is a “parity” distribution:

In [5]: from dit.example_dists import n_mod_m

In [6]: d = n_mod_m(3, 2)

In [7]: total_correlation(d)
Out[7]: 1.0

In [8]: binding_information(d)
Out[8]: 2.0
Relationship to Other Measures

The binding information obeys particular bounds related to both the Entropy and the Total Correlation:

\[\begin{split}0 \leq & \B[X_{0:n}] \leq \H[X_{0:n}] \\ \frac{\T[X_{0:n}]}{n-1} \leq & \B[X_{0:n}] \leq (n-1)\T[X_{0:n}]\end{split}\]
Visualization

The binding information, as seen below, consists equally of the information shared among the variables.

The binding information :math:`\B[X:Y:Z]`
API
binding_information(dist, rvs=None, crvs=None, rv_mode=None)
Parameters:
  • dist (Distribution) – The distribution from which the binding information is calculated.
  • rvs (list, None) – The indexes of the random variable used to calculate the binding information. If None, then the binding information is calculated over all random variables.
  • crvs (list, None) – The indexes of the random variables to condition on. If None, then no variables are condition on.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

B – The binding information.

Return type:

float

Raises:

ditException – Raised if dist is not a joint distribution or if rvs or crvs contain non-existant random variables.

Residual Entropy

The residual entropy, or erasure entropy, is a dual to the Binding Information. It is dual in the sense that together they form the entropy of the distribution.

\[\begin{split}\R[X_{0:n}] &= \sum \H[X_i | X_{\{0..n\}/i}] \\ &= -\sum_{x_{0:n} \in X_{0:n}} p(x_{0:n}) \log_2 \prod p(x_i|x_{\{0:n\}/i})\end{split}\]

The residual entropy was originally proposed in [VW08] to quantify the information lost by sporatic erasures in a channel. The idea here is that only the information uncorrelated with other random variables is lost if that variable is erased.

If a joint distribution consists of independent random variables, the residual entropy is equal to the Entropy:

In [1]: from dit.multivariate import entropy, residual_entropy

In [2]: d = dit.uniform_distribution(3, 2)

In [3]: entropy(d) == residual_entropy(d)
Out[3]: True

Another simple example is a distribution where one random variable is independent of the others:

In [4]: d = dit.uniform(['000', '001', '110', '111'])

In [5]: residual_entropy(d)
Out[5]: 1.0

If we ask for the residual entropy of only the latter two random variables, the middle one is now independent of the others and so the residual entropy grows:

In [6]: residual_entropy(d, [[1], [2]])
Out[6]: 2.0
Visualization

The residual entropy consists of all the unshared information in the distribution. That is, it is the information in each variable not overlapping with any other.

The residual entropy :math:`\R[X:Y]` The residual entropy :math:`\R[X:Y:Z]`
API
residual_entropy(dist, rvs=None, crvs=None, rv_mode=None)
Parameters:
  • dist (Distribution) – The distribution from which the residual entropy is calculated.
  • rvs (list, None) – The indexes of the random variable used to calculate the residual entropy. If None, then the total correlation is calculated over all random variables.
  • crvs (list, None) – The indexes of the random variables to condition on. If None, then no variables are condition on.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

R – The residual entropy.

Return type:

float

Raises:

ditException – Raised if dist is not a joint distribution or if rvs or crvs contain non-existant random variables.

TSE Complexity

The Tononi-Sporns-Edelmans (TSE) complexity [TSE94] is a complexity measure for distributions. It is designed so that it maximized by distributions where small subsets of random variables are loosely coupled but the overall distribution is tightly coupled.

\[\begin{split}\TSE[X|Z] = \sum_{k=1}^{|X|} \left( {N \choose k}^{-1} \sum_{\substack{y \subseteq X \\ |y|=k}} \left( \H[y|Z] \right) - \frac{k}{|X|}\H[X|Z] \right)\end{split}\]

Two distributions which might be considered tightly coupled are the “giant bit” and the “parity” distributions:

In [1]: from dit.multivariate import tse_complexity

In [2]: from dit.example_dists import Xor

In [3]: d1 = Xor()

In [4]: tse_complexity(d1)
Out[4]: 1.0

In [5]: d2 = dit.Distribution(['000', '111'], [1/2, 1/2])

In [6]: tse_complexity(d2)
Out[6]: 1.0

The TSE Complexity assigns them both a value of \(1.0\) bits, which is the maximal value the TSE takes over trivariate, binary alphabet distributions.

API
tse_complexity(dist, rvs=None, crvs=None, rv_mode=None)

Calculates the TSE complexity.

Parameters:
  • dist (Distribution) – The distribution from which the TSE complexity is calculated.
  • rvs (list, None) – The indexes of the random variable used to calculate the TSE complexity between. If None, then the TSE complexity is calculated over all random variables.
  • crvs (list, None) – The indexes of the random variables to condition on. If None, then no variables are condition on.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

TSE – The TSE complexity.

Return type:

float

Raises:

ditException – Raised if dist is not a joint distribution or if rvs or crvs contain non-existant random variables.

Non-Shannon Measures

The next set of measures are based on the definition of an auxiliary random variable, generally one which in some way captures the information common to the random variables.

Gács-Körner Common Information

The Gács-Körner common information [GacsKorner73] take a very direct approach to the idea of common information. It extracts a random variable that is contained within each of the random variables under consideration.

The Common Information Game

Let’s play a game. We have an n-variable joint distribution, and one player for each variable. Each player is given the probability mass function of the joint distribution then isolated from each other. Each round of the game the a joint outcome is generated from the distribution and each player is told the symbol that their particular variable took. The goal of the game is for the players to simultaneously write the same symbol on a piece of paper, and for the entropy of the players’ symbols to be maximized. They must do this using only their knowledge of the joint random variable and the particular outcome of their marginal variable. The matching symbols produced by the players are called the common random variable and the entropy of that variable is the Gács-Körner common information, \(\K\).

Two Variables

Consider a joint distribution over \(X_0\) and \(X_1\). Given any particular outcome from that joint, we want a function \(f(X_0)\) and a function \(g(X_1)\) such that \(\forall x_0x_1 = X_0X_1, f(x_0) = g(x_1) = v\). Of all possible pairs of functions \(f(X_0) = g(X_1) = V\), there exists a “largest” one, and it is known as the common random variable. The entropy of that common random variable is the Gács-Körner common information:

\[\begin{split}\K[X_0 : X_1] &= \max_{f(X_0) = g(X_1) = V} \H[V] \\ &= \H[X_0 \meet X_1]\end{split}\]

As a canonical example, consider the following:

In [1]: from dit import Distribution as D

In [2]: from dit.multivariate import gk_common_information as K

In [3]: outcomes = ['00', '01', '10', '11', '22', '33']

In [4]: pmf = [1/8, 1/8, 1/8, 1/8, 1/4, 1/4]

In [5]: d = D(outcomes, pmf, sample_space=outcomes)

In [6]: K(d)
Out[6]: 1.5

Note

It is important that we set the sample_space argument. If it is None then the Cartesian product of each alphabet, and in such a case the meet will trivially be degenerate.

So, the Gács-Körner common information is 1.5 bits. But what is the common random variable?

In [7]: from dit.algorithms import insert_meet

In [8]: crv = insert_meet(d, -1, [[0],[1]])

In [9]: print(crv)
Class:          Distribution
Alphabet:       (('0', '1', '2', '3'), ('0', '1', '2', '3'), ('0', '1', '2'))
Base:           linear
Outcome Class:  str
Outcome Length: 3
RV Names:       None

x     p(x)
002   0.125
012   0.125
102   0.125
112   0.125
220   0.25
331   0.25

Looking at the third index of the outcomes, we see that the common random variable maps 2 to 0 and 3 to 1, maintaining the information from those values. When \(X_0\) or \(X_1\) are either 0 or 1, however, it maps them to 2. This is because \(f\) and \(g\) must act independently: if \(x_0\) is a 0 or a 1, there is no way to know if \(x_1\) is a 0 or a 1 and vice versa. Therefore we aggregate 0s and 1s into 2.

Visualization

The Gács-Körner common information is the largest “circle” that entirely fits within the mutual information’s “football”:

The Gács-Körner common information :math:`\K[X:Y]`
Properties & Uses

The Gács-Körner common information satisfies an important inequality:

\[0 \leq \K[X_0:X_1] \leq \I[X_0:X_1]\]

One usage of the common information is as a measure of redundancy [GCJ+13]. Consider a function that takes two inputs, \(X_0\) and \(X_1\), and produces a single output \(Y\). The output can be influenced redundantly by both inputs, uniquely from either one, or together they can synergistically influence the output. Determining how to compute the amount of redundancy is an open problem, but one proposal is:

\[\I[X_0 \meet X_1 : Y]\]

Which can be visualized as this:

The zero-error redundancy :math:`\I[X\meetY:Z]`

This quantity can be computed easily using dit:

In [10]: from dit.example_dists import RdnXor

In [11]: from dit.shannon import mutual_information as I

In [12]: d = RdnXor()

In [13]: d = dit.pruned_samplespace(d)

In [14]: d = insert_meet(d, -1, [[0],[1]])

In [15]: I(d, [3], [2])
Out[15]: 1.0
\(n\)-Variables

With an arbitrary number of variables, the Gács-Körner common information is defined similarly:

\[\begin{split}\K[X_0 : \ldots : X_n] &= \max_{\substack{V = f_0(X_0) \\ \vdots \\ V = f_n(X_n)}} \H[V] \\ &= \H[X_0 \meet \ldots \meet X_n]\end{split}\]

The common information is a monotonically decreasing function in the number of variables:

\[\K[X_0 : \ldots : X_{n-1}] \ge \K[X_0 : \ldots : X_n]\]

The multivariate common information follows a similar inequality as the two variable version:

\[0 \leq \K[X_0 : \dots : X_n] \leq \min_{i, j \in \{0..n\}} \I[X_i : X_j]\]
Visualization

Here, as above, the Gács-Körner common information among three variables is the largest “circle” this time fiting in the vaguely triangular Co-Information region.

The Gács-Körner common information :math:`\K[X:Y:Z]`
API
gk_common_information(dist, rvs=None, crvs=None, rv_mode=None)

Returns the Gacs-Korner common information K[X1:X2...] over the random variables in rvs.

Parameters:
  • dist (Distribution) – The distribution from which the common information is calculated.
  • rvs (list, None) – The indexes of the random variables for which the Gacs-Korner common information is to be computed. If None, then the common information is calculated over all random variables.
  • crvs (list, None) – The indexes of the random variables to condition the common information by. If none, than there is no conditioning.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

K – The Gacs-Korner common information of the distribution.

Return type:

float

Raises:

ditException – Raised if rvs or crvs contain non-existant random variables.

This next group of measures can not be represented on information diagrams, and can not really be directly compared to the measures above:

Other Measures

Other measures of information. These are generally based around alternatives to the Shannon entropy proposed for a variety of reasons.

Cumulative Residual Entropy

The cumulative residual entropy [RCVW04] is an alternative to the differential Shannon entropy. The differential entropy has many issues, including that it can be negative even for simple distributions such as the uniform distribution; and that if one takes discrete estimates that limit to the continuous distribution, the discrete entropy does not limit to the differential (continuous) entropy. It also attempts to provide meaningful differences between numerically different random variables, such as a die labeled [1, 2, 3, 4, 5, 6] and one lebeled [1, 2, 3, 4, 5, 100].

Note

The Cumulative Residual Entropy is unrelated to Residual Entropy.

\[\begin{split}\CRE[X] = -\int_{0}^{\infty} p(|X| > x) \log_{2} p(|X| > x) dx\end{split}\]
In [1]: from dit.other import cumulative_residual_entropy

In [2]: d1 = dit.ScalarDistribution([1, 2, 3, 4, 5, 6], [1/6]*6)

In [3]: d2 = dit.ScalarDistribution([1, 2, 3, 4, 5, 100], [1/6]*6)

In [4]: cumulative_residual_entropy(d1)
Out[4]: 2.0683182557028439

In [5]: cumulative_residual_entropy(d2)
Out[5]: 22.672680046016705
Generalized Cumulative Residual Entropy

The genearlized form of the cumulative residual entropy integrates over the intire set of reals rather than just the positive ones:

\[\begin{split}\GCRE[X] = -\int_{-\infty}^{\infty} p(X > x) \log_{2} p(X > x) dx\end{split}\]
In [6]: from dit.other import generalized_cumulative_residual_entropy

In [7]: generalized_cumulative_residual_entropy(d1)
Out[7]: 2.0683182557028439

In [8]: d3 = dit.ScalarDistribution([-2, -1, 0, 1, 2], [1/5]*5)

In [9]: cumulative_residual_entropy(d3)
Out[9]: 0.90656497547719606

In [10]: generalized_cumulative_residual_entropy(d3)
Out[10]: 1.6928786893420307
Conditional Cumulative Residual Entropy

The conditional cumulative residual entropy \(\CRE[X|Y]\) is a distribution with the same probability mass function as \(Y\), and the outcome associated with \(p(y)\) is equal to the cumulative residual entropy over probabilities conditioned on \(Y = y\). In this sense the conditional cumulative residual entropy is more akin to a distribution over \(\H[X|Y=y]\) than the single scalar quantity \(\H[X|Y]\).

\[\begin{split}\CRE[X|Y] = - \int_{0}^{\infty} p(|X| > x | Y) \log_{2} p(|X| > x | Y) dx\end{split}\]
Conditional Generalized Cumulative Residual Entropy

Conceptually the conditional generalized cumulative residual entropy is the same as the non-generalized form, but integrated over the entire real line rather than just the positive:

\[\begin{split}\GCRE[X|Y] = - \int_{-\infty}^{\infty} p(X > x | Y) \log_{2} p(X > x | Y) dx\end{split}\]
API
cumulative_residual_entropy(dist, extract=False)

The cumulative residual entropy is an alternative to the Shannon differential entropy with several desirable properties including non-negativity.

Parameters:
  • dist (Distribution) – The distribution to compute the cumulative residual entropy of each index for.
  • extract (bool) – If True and dist.outcome_length() is 1, return the single GCRE value rather than a length-1 array.
Returns:

CREs – The cumulative residual entropy for each index.

Return type:

ndarray

Examples

>>> d1 = ScalarDistribution([1, 2, 3, 4, 5, 6], [1/6]*6)
>>> d2 = ScalarDistribution([1, 2, 3, 4, 5, 100], [1/6]*6)
>>> cumulative_residual_entropy(d1)
2.0683182557028439
>>> cumulative_residual_entropy(d2)
22.672680046016705
generalized_cumulative_residual_entropy(dist, extract=False)

The generalized cumulative residual entropy is a generalized from of the cumulative residual entropy. Rarther than integrating from 0 to infinty over the absolute value of the CDF.

Parameters:
  • dist (Distribution) – The distribution to compute the generalized cumulative residual entropy of each index for.
  • extract (bool) – If True and dist.outcome_length() is 1, return the single GCRE value rather than a length-1 array.
Returns:

GCREs – The generalized cumulative residual entropy for each index.

Return type:

ndarray

Examples

>>> generalized_cumulative_residual_entropy(uniform(-2, 3))
1.6928786893420307
>>> generalized_cumulative_residual_entropy(uniform(0, 5))
1.6928786893420307
Conditional Forms
conditional_cumulative_residual_entropy(dist, rv, crvs=None, rv_mode=None)

Returns the conditional cumulative residual entropy.

Parameters:
  • dist (Distribution) – The distribution to compute the conditional cumulative residual entropy of.
  • rv (list, None) – The possibly joint random variable to compute the conditional cumulative residual entropy of. If None, then all variables not in crvs are used.
  • crvs (list, None) – The random variables to condition on. If None, nothing is conditioned on.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

CCRE – The conditional cumulative residual entropy.

Return type:

ScalarDistribution

Examples

>>> from itertools import product
>>> events = [ (a, b) for a, b, in product(range(5), range(5)) if a <= b ]
>>> probs = [ 1/(5-a)/5 for a, b in events ]
>>> d = Distribution(events, probs)
>>> print(conditional_cumulative_residual_entropy(d, 1, [0]))
Class:    ScalarDistribution
Alphabet: (-0.0, 0.5, 0.91829583405448956, 1.3112781244591329, 1.6928786893420307)
Base:     linear

x p(x) -0.0 0.2 0.5 0.2 0.918295834054 0.2 1.31127812446 0.2 1.69287868934 0.2

conditional_generalized_cumulative_residual_entropy(dist, rv, crvs=None, rv_mode=None)

Returns the conditional cumulative residual entropy.

Parameters:
  • dist (Distribution) – The distribution to compute the conditional generalized cumulative residual entropy of.
  • rv (list, None) – The possibly joint random variable to compute the conditional generalized cumulative residual entropy of. If None, then all variables not in crvs are used.
  • crvs (list, None) – The random variables to condition on. If None, nothing is conditioned on.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

CCRE – The conditional cumulative residual entropy.

Return type:

ScalarDistribution

Examples

>>> from itertools import product
>>> events = [ (a-2, b-2) for a, b, in product(range(5), range(5)) if a <= b ]
>>> probs = [ 1/(3-a)/5 for a, b in events ]
>>> d = Distribution(events, probs)
>>> print(conditional_generalized_cumulative_residual_entropy(d, 1, [0]))
Class:    ScalarDistribution
Alphabet: (-0.0, 0.5, 0.91829583405448956, 1.3112781244591329, 1.6928786893420307)
Base:     linear

x p(x) -0.0 0.2 0.5 0.2 0.918295834054 0.2 1.31127812446 0.2 1.69287868934 0.2

Rényi Entropy

The Rényi entropy is a spectrum of generalizations to the Shannon Entropy:

\[\RE[X] = \frac{1}{1-\alpha} \log_2 \left( \sum_{x \in \mathcal{X}} p(x)^\alpha \right)\]
In [1]: from dit.other import renyi_entropy

In [2]: from dit.example_dists import binomial

In [3]: d = binomial(15, 0.4)

In [4]: renyi_entropy(d, 3)
Out[4]: 2.6611840717104625
Special Cases

For several values of \(\alpha\), the Rényi entropy takes on particular values.

\(\alpha = 0\)

When \(\alpha = 0\) the Rényi entropy becomes what is known as the Hartley entropy:

\[\H_{0}[X] = \log_2 |X|\]
In [5]: renyi_entropy(d, 0)
Out[5]: 4.0
\(\alpha = 1\)

When \(\alpha = 1\) the Rényi entropy becomes the standard Shannon entropy:

\[\H_{1}[X] = \H[X]\]
In [6]: renyi_entropy(d, 1)
Out[6]: 2.9688513169509618
\(\alpha = 2\)

When \(\alpha = 2\), the Rényi entropy becomes what is known as the collision entropy:

\[\H_{2}[X] = - \log_2 p(X = Y)\]

where \(Y\) is an IID copy of X. This is basically the surprisal of “rolling doubles”

In [7]: renyi_entropy(d, 2)
Out[7]: 2.7607270851693615
\(\alpha = \infty\)

Finally, when \(\alpha = \infty\) the Rényi entropy picks out the probability of the most-probable event:

\[\H_{\infty}[X] = - \log_2 \max_{x \in \mathcal{X}} p(x)\]
In [8]: renyi_entropy(d, np.inf)
Out[8]: 2.275104563096674
General Properies

In general, the Rényi entropy is a monotonically decreasing function in \(\alpha\):

\[\begin{split}\H_{\alpha}[X] \ge \H_{\beta}[X], \quad \beta > \alpha\end{split}\]

Further, the following inequality holds in the other direction:

\[\H_{2}[X] \le 2 \H_{\infty}[X]\]
API
renyi_entropy(dist, order, rvs=None, rv_mode=None)

Compute the Renyi entropy of order order.

Parameters:
  • dist (Distribution) – The distribution to take the Renyi entropy of.
  • order (float >= 0) – The order of the Renyi entropy.
  • rvs (list, None) – The indexes of the random variable used to calculate the Renyi entropy of. If None, then the Renyi entropy is calculated over all random variables.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

H_a – The Renyi entropy.

Return type:

float

Raises:
  • ditException – Raised if rvs or crvs contain non-existant random variables.
  • ValueError – Raised if order is not a non-negative float.

Tsallis Entropy

The Tsallis entropy is a generalization of the Shannon (or Boltzmann-Gibbs) entropy to the case where entropy is nonextensive. It is given by:

\[\TE[X] = \frac{1}{q - 1} \left( 1 - \sum_{x \in \mathcal{X}} p(x)^q \right)\]
In [1]: from dit.other import tsallis_entropy

In [2]: from dit.example_dists import n_mod_m

In [3]: d = n_mod_m(4, 3)

In [4]: tsallis_entropy(d, 4)
Out[4]: 0.33331639824552489
Non-additivity

One interesting property of the Tsallis entropy is the relationship between the joint Tsallis entropy of two indpendent systems, and the Tsallis entropy of those subsystems:

\[\TE[X, Y] = \TE[X] + \TE[Y] + (1-q)\TE[X]\TE[Y]\]
API
tsallis_entropy(dist, order, rvs=None, rv_mode=None)

Compute the Tsallis entropy of order order.

Parameters:
  • dist (Distribution) – The distribution to take the Tsallis entropy of.
  • order (float >= 0) – The order of the Tsallis entropy.
  • rvs (list, None) – The indexes of the random variable used to calculate the Tsallis entropy of. If None, then the Tsallis entropy is calculated over all random variables.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

S_q – The Tsallis entropy.

Return type:

float

Raises:
  • ditException – Raised if rvs or crvs contain non-existant random variables.
  • ValueError – Raised if order is not a non-negative float.

Perplexity

The perplexity is a trivial measure to make the Entropy more intuitive:

\[\P[X] = 2^{\H[X]}\]

The perplexity of a random variable is the size of a uniform distribution that would have the same entropy. For example, a distribution with 2 bits of entropy has a perplexity of 4, and so could be said to be “as random” as a four-sided die.

The conditional perplexity is defined in the natural way:

\[\P[X|Y] = 2^{\H[X|Y]}\]

We can see that the xor distribution is “4-way” perplexed:

In [1]: from dit.other import perplexity

In [2]: from dit.example_dists import Xor

In [3]: perplexity(Xor())
Out[3]: 4.0
API
perplexity(dist, rvs=None, crvs=None, rv_mode=None)
Parameters:
  • dist (Distribution) – The distribution from which the perplexity is calculated.
  • rvs (list, None) – The indexes of the random variable used to calculate the perplexity. If None, then the perpelxity is calculated over all random variables.
  • crvs (list, None) – The indexes of the random variables to condition on. If None, then no variables are condition on.
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
Returns:

P – The perplexity.

Return type:

float

Extropy

The extropy [LSAgro11] is a dual to the Entropy. It is defined by:

\[\J[X] = -\sum_{x \in X} (1-p(x)) \log_2 (1-p(x))\]

The entropy and the extropy satisify the following relationship:

\[\H[X] + \J[X] = \sum_{x \in \mathcal{X}} \H[p(x), 1-p(x)] = \sum_{x \in \mathcal{X}} \J[p(x), 1-p(x)]\]

Unfortunately, the extropy does not yet have any intuitive interpretation.

In [1]: from dit.other import extropy

In [2]: from dit.example_dists import Xor

In [3]: extropy(Xor())
Out[3]: 1.2451124978365313

In [4]: extropy(Xor(), [0])
Out[4]: 1.0
API
extropy(dist, rvs=None, rv_mode=None)

Returns the extropy J[X] over the random variables in rvs.

If the distribution represents linear probabilities, then the extropy is calculated with units of ‘bits’ (base-2).

Parameters:
  • dist (Distribution or float) – The distribution from which the extropy is calculated. If a float, then we calculate the binary extropy.
  • rvs (list, None) – The indexes of the random variable used to calculate the extropy. If None, then the extropy is calculated over all random variables. This should remain None for ScalarDistributions.
  • rv_mode (str, None) – Specifies how to interpret the elements of rvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted.
Returns:

J – The extropy of the distribution.

Return type:

float

There are also measures of “distance” or divergence between two (and im some cases, more) distribution:

Divergences

Divergences are measures of comparison between distributions:

Cross Entropy

The cross entropy between two distributions \(p(x)\) and \(q(x)\) is given by:

\[\xH[p || q] = -\sum_{x \in \mathcal{X}} p(x) \log_2 q(x)\]

This quantifies the average cost of representing a distribution defined by the probabilities \(p(x)\) using the probabilities \(q(x)\). For example, the cross entropy of a distribution with itself is the entropy of that distribion because the entropy quantifies the average cost of representing a distribution:

In [1]: from dit.divergences import cross_entropy

In [2]: p = dit.Distribution(['0', '1'], [1/2, 1/2])

In [3]: cross_entropy(p, p)
Out[3]: 1.0

If, however, we attempted to model a fair coin with a biased on, we could compute this mis-match with the cross entropy:

In [4]: q = dit.Distribution(['0', '1'], [3/4, 1/4])

In [5]: cross_entropy(p, q)
Out[5]: 1.207518749639422

Meaning, we will on average use about \(1.2\) bits to represent the flips of a fair coin. Turning things around, what if we had a biased coin that we attempted to represent with a fair coin:

In [6]: cross_entropy(q, p)
Out[6]: 1.0

So although the entropy of \(q\) is less than \(1\), we will use a full bit to represent its outcomes. Both of these results can easily be seen by considering the following identity:

\[\xH[p || q] = \H[p] + \DKL[p || q]\]

So in representing \(p\) using \(q\), we of course must at least use \(\H[p]\) bits – the minimum required to represent \(p\) – plus the Kullback-Leibler divergence of \(q\) from \(p\).

API
cross_entropy(dist1, dist2, rvs=None, crvs=None, rv_mode=None)

The cross entropy between dist1 and dist2.

Parameters:
  • dist1 (Distribution) – The first distribution in the cross entropy.
  • dist2 (Distribution) – The second distribution in the cross entropy.
  • rvs (list, None) – The indexes of the random variable used to calculate the cross entropy between. If None, then the cross entropy is calculated over all random variables.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

xh – The cross entropy between dist1 and dist2.

Return type:

float

Raises:

ditException – Raised if either dist1 or dist2 doesn’t have rvs or, if rvs is None, if dist2 has an outcome length different than dist1.

Kullback-Leibler Divergence

The Kullback-Leibler divergence, sometimes also called the relative entropy, of a distribution \(p\) from a distribution \(q\) is defined as:

\[\DKL[p || q] = \sum_{x \in \mathcal{X}} p(x) \log_2 \frac{p(x)}{q(x)}\]

The Kullback-Leibler divergence quantifies the average number of extra bits required to represent a distribution \(p\) when using an arbitrary distribution \(q\). This can be seen through the following identity:

\[\DKL[p || q] = \xH[p || q] - \H[p]\]

Where the Cross Entropy quantifies the total cost of encoding \(p\) using \(q\), and the Entropy quantifies the true, minimum cost of encoding \(p\). For example, let’s consider the cost of representing a biased coin by a fair one:

In [1]: from dit.divergences import kullback_leibler_divergence

In [2]: p = dit.Distribution(['0', '1'], [3/4, 1/4])

In [3]: q = dit.Distribution(['0', '1'], [1/2, 1/2])

In [4]: kullback_leibler_divergence(p, q)
Out[4]: 0.18872187554086717

That is, it costs us \(0.1887\) bits of wasted overhead by using a mismatched distribution.

Not a Metric

Although the Kullback-Leibler divergence is often used to see how “different” two distributions are, it is not a metric. Importantly, it is neither symmetric nor does it obey the triangle inequality. It does, however, have the following property:

\[\DKL[p || q] \ge 0\]

with equality if and only if \(p = q\). This makes it a premetric.

API
kullback_leibler_divergence(dist1, dist2, rvs=None, crvs=None, rv_mode=None)

The Kullback-Liebler divergence between dist1 and dist2.

Parameters:
  • dist1 (Distribution) – The first distribution in the Kullback-Leibler divergence.
  • dist2 (Distribution) – The second distribution in the Kullback-Leibler divergence.
  • rvs (list, None) – The indexes of the random variable used to calculate the Kullback-Leibler divergence between. If None, then the Kullback-Leibler divergence is calculated over all random variables.
  • rv_mode (str, None) – Specifies how to interpret rvs and crvs. Valid options are: {‘indices’, ‘names’}. If equal to ‘indices’, then the elements of crvs and rvs are interpreted as random variable indices. If equal to ‘names’, the the elements are interpreted as random variable names. If None, then the value of dist._rv_mode is consulted, which defaults to ‘indices’.
Returns:

dkl – The Kullback-Leibler divergence between dist1 and dist2.

Return type:

float

Raises:

ditException – Raised if either dist1 or dist2 doesn’t have rvs or, if rvs is None, if dist2 has an outcome length different than dist1.

Jensen-Shannon Divergence

The Jensen-Shannon divergence is a principled divergence measure which is always finite for finite random variables. It quantifies how “distinguishable” two or more distributions are from each other. In its basic form it is:

\[\JSD[X || Y] = \H\left[\frac{X + Y}{2}\right] - \frac{\H[X] + \H[Y]}{2}\]

That is, it is the entropy of the mixture minus the mixture of the entropy. This can be generalized to an arbitrary number of random variables with arbitrary weights:

\[\JSD[X_{0:n}] = \H\left[ \sum w_i X_i \right] - \sum \left( w_i \H[X_i] \right)\]
In [1]: from dit.divergences import jensen_shannon_divergence

In [2]: X = dit.ScalarDistribution(['red', 'blue'], [1/2, 1/2])

In [3]: Y = dit.ScalarDistribution(['blue', 'green'], [1/2, 1/2])

In [4]: jensen_shannon_divergence([X, Y])
Out[4]: 0.5

In [5]: jensen_shannon_divergence([X, Y], [3/4, 1/4])
Out[5]: 0.40563906222956647

In [6]: Z = dit.ScalarDistribution(['blue', 'yellow'], [1/2, 1/2])

In [7]: jensen_shannon_divergence([X, Y, Z])
Out[7]: 0.79248125036057782

In [8]: jensen_shannon_divergence([X, Y, Z], [1/2, 1/4, 1/4])
Out[8]: 0.75
Derivation

Where does this equation come from? Consider Jensen’s inequality:

\[\Psi \left( \mathbb{E}(x) \right) \geq \mathbb{E} \left( \Psi(x) \right)\]

where \(\Psi\) is a concave function. If we consider the divergence of the left and right side we find:

\[\Psi \left( \mathbb{E}(x) \right) - \mathbb{E} \left( \Psi(x) \right) \geq 0\]

If we make that concave function \(\Psi\) the Shannon entropy \(\H\), we get the Jensen-Shannon divergence. Jensen from Jensen’s inequality, and Shannon from the use of the Shannon entropy.

Note

Some people look at the Jensen-Rényi divergence (where \(\Psi\) is the Rényi Entropy) and the Jensen-Tsallis divergence (where \(\Psi\) is the Tsallis Entropy).

Metric

The square root of the Jensen-Shannon divergence, \(\sqrt{\JSD}\), is a true metric between distributions.

Relationship to the Other Measures

The Jensen-Shannon divergence can be derived from other, more well known information measures; notably the Kullback-Leibler divergence and the mutual information.

Kullback-Leibler Divergence

The Jensen-Shannon divergence is the average Kullback-Leibler divergence of \(X\) and \(Y\) from their mixture distribution, \(M\):

\[\begin{split}\JSD[X || Y] &= \frac{1}{2} \left( \DKL[X || M] + \DKL[Y || M] \right) \\ M &= \frac{X + Y}{2}\end{split}\]
Mutual Information
\[\JSD[X || Y] = \I[Z:M]\]

where \(M\) is the mixture distribution as before, and \(Z\) is an indicator variable over \(X\) and \(Y\). In essence, if \(X\) and \(Y\) are each an urn containing colored balls, and I randomly selected one of the urns and draw a ball from it, then the Jensen-Shannon divergence is the mutual information between which urn I drew the ball from, and the color of the ball drawn.

API
jensen_shannon_divergence(dists, weights=None)

The Jensen-Shannon Divergence: H(sum(w_i*P_i)) - sum(w_i*H(P_i)).

The square root of the Jensen-Shannon divergence is a distance metric.

Parameters:
  • dists ([Distribution]) – The distributions, P_i, to take the Jensen-Shannon Divergence of.
  • weights ([float], None) – The weights, w_i, to give the distributions. If None, the weights are assumed to be uniform.
Returns:

jsd – The Jensen-Shannon Divergence

Return type:

float

Raises:
  • ditException – Raised if there dists and weights have unequal lengths.
  • InvalidNormalization – Raised if the weights do not sum to unity.
  • InvalidProbability – Raised if the weights are not valid probabilities.

While the cross entropy and the Kullback-Leibler divergence are not true metrics, the square root of the Jensen-Shannon divergence is.

Information Profiles

There are several ways to decompose the information contained in a joint distribution. Here, we will demonstrate their behavior using four examples drawn from [ASBY14]:

In [1]: from dit.profiles import *
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-e446b5fad016> in <module>()
----> 1 from dit.profiles import *

build/bdist.linux-x86_64/egg/dit/profiles/__init__.py in <module>()

build/bdist.linux-x86_64/egg/dit/profiles/marginal_utility_of_information.py in <module>()

ImportError: cannot import name linprog

In [2]: ex1 = dit.Distribution(['000', '001', '010', '011', '100', '101', '110', '111'], [1/8]*8)

In [3]: ex2 = dit.Distribution(['000', '111'], [1/2]*2)

In [4]: ex3 = dit.Distribution(['000', '001', '110', '111'], [1/4]*4)

In [5]: ex4 = dit.Distribution(['000', '011', '101', '110'], [1/4]*4)

The I-diagrams for these four examples can be computed thusly:

In [6]: from dit.algorithms import ShannonPartition

In [7]: print(ShannonPartition(ex1))
+----------+--------+
| measure  |  bits  |
+----------+--------+
| H[2|0,1] |  1.000 |
| H[1|0,2] |  1.000 |
| H[0|1,2] |  1.000 |
| I[1:2|0] |  0.000 |
| I[0:1|2] |  0.000 |
| I[0:2|1] |  0.000 |
| I[0:1:2] |  0.000 |
+----------+--------+

In [8]: print(ShannonPartition(ex2))
+----------+--------+
| measure  |  bits  |
+----------+--------+
| H[2|0,1] |  0.000 |
| H[1|0,2] |  0.000 |
| H[0|1,2] |  0.000 |
| I[1:2|0] |  0.000 |
| I[0:1|2] |  0.000 |
| I[0:2|1] |  0.000 |
| I[0:1:2] |  1.000 |
+----------+--------+

In [9]: print(ShannonPartition(ex3))
+----------+--------+
| measure  |  bits  |
+----------+--------+
| H[2|0,1] |  1.000 |
| H[1|0,2] |  0.000 |
| H[0|1,2] |  0.000 |
| I[1:2|0] |  0.000 |
| I[0:1|2] |  1.000 |
| I[0:2|1] |  0.000 |
| I[0:1:2] |  0.000 |
+----------+--------+

In [10]: print(ShannonPartition(ex4))
+----------+--------+
| measure  |  bits  |
+----------+--------+
| H[2|0,1] |  0.000 |
| H[1|0,2] |  0.000 |
| H[0|1,2] |  0.000 |
| I[1:2|0] |  1.000 |
| I[0:1|2] |  1.000 |
| I[0:2|1] |  1.000 |
| I[0:1:2] | -1.000 |
+----------+--------+

Complexity Profile

The complexity profile is simply the amount of information at scale \(\geq k\) of each “layer” of the I-diagram [BY04].

Consider example 1, which contains three independent bits. Each of these bits are in the outermost “layer” of the i-diagram, and so the information in the complexity profile is all at layer 1:

In [11]: ComplexityProfile(ex1).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-850d1430a4c3> in <module>()
----> 1 ComplexityProfile(ex1).draw()

NameError: name 'ComplexityProfile' is not defined
The complexity profile for example 1

Whereas in example 2, all the information is in the center, and so each scale of the complexity profile picks up that one bit:

In [12]: ComplexityProfile(ex2).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-51ab6ae2e99f> in <module>()
----> 1 ComplexityProfile(ex2).draw()

NameError: name 'ComplexityProfile' is not defined
The complexity profile for example 2

Both bits in example 3 are at a scale of at least 1, but only the shared bit persists to scale 2:

In [13]: ComplexityProfile(ex3).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-6311c54eb5f5> in <module>()
----> 1 ComplexityProfile(ex3).draw()

NameError: name 'ComplexityProfile' is not defined
The complexity profile for example 3

Finally, example 4 (where each variable is the exclusive or of the other two):

In [14]: ComplexityProfile(ex4).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-d42e22fe7b80> in <module>()
----> 1 ComplexityProfile(ex4).draw()

NameError: name 'ComplexityProfile' is not defined
The complexity profile for example 4

Marginal Utility of Information

The marginal utility of information (MUI) [ASBY14] takes a different approach. It asks, given an amount of information \(\I[d : \{X\}] = y\), what is the maximum amount of information one can extract using an auxilliary variable \(d\) as measured by the sum of the pairwise mutual informations, \(\sum \I[d : X_i]\). The MUI is then the rate of this maximum as a function of \(y\).

For the first example, each bit is independent and so basically must be extracted independently. Thus, as one increases \(y\) the maximum amount extracted grows equally:

In [15]: MUIProfile(ex1).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-a5377e16ac48> in <module>()
----> 1 MUIProfile(ex1).draw()

NameError: name 'MUIProfile' is not defined
The MUI profile for example 1

In the second example, there is only one bit total to be extracted, but it is shared by each pairwise mutual information. Therefore, for each increase in \(y\) we get a threefold increase in the amount extracted:

In [16]: MUIProfile(ex2).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-da24b4abf7f6> in <module>()
----> 1 MUIProfile(ex2).draw()

NameError: name 'MUIProfile' is not defined
The MUI profile for example 2

For the third example, for the first one bit of \(y\) we can pull from the shared bit, but after that one must pull from the independent bit, so we see a step in the MUI profile:

In [17]: MUIProfile(ex3).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-35d7522e2d62> in <module>()
----> 1 MUIProfile(ex3).draw()

NameError: name 'MUIProfile' is not defined
The MUI profile for example 3

Lastly, the xor example:

In [18]: MUIProfile(ex4).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-e9bb2bcb7e49> in <module>()
----> 1 MUIProfile(ex4).draw()

NameError: name 'MUIProfile' is not defined
The MUI profile for example 4

Schneidman Profile

Also known as the connected information or network informations, the Schneidman profile exposes how much information is learned about the distribution when considering \(k\)-way dependencies [SSB+03]. In all the following examples, each individual marginal is already uniformly distributed, and so the connected information at scale 1 is 0.

In the first example, all the random variables are independent already, so fixing marginals above \(k=1\) does not result in any change to the inferred distribution:

In [19]: SchneidmanProfile(ex1).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-0f650431dd1c> in <module>()
----> 1 SchneidmanProfile(ex1).draw()

NameError: name 'SchneidmanProfile' is not defined
The Schneidman profile for example 1

In the second example, by learning the pairwise marginals, we reduce the entropy of the distribution by two bits (from three independent bits, to one giant bit):

In [20]: SchneidmanProfile(ex2).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-0844f957a1c0> in <module>()
----> 1 SchneidmanProfile(ex2).draw()

NameError: name 'SchneidmanProfile' is not defined
The Schneidman profile for example 2

For the third example, learning pairwise marginals only reduces the entropy by one bit:

In [21]: SchneidmanProfile(ex3).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-45907c62856b> in <module>()
----> 1 SchneidmanProfile(ex3).draw()

NameError: name 'SchneidmanProfile' is not defined
The Schneidman profile for example 3

And for the xor, all bits appear independent until fixing the three-way marginals at which point one bit about the distribution is learned:

In [22]: SchneidmanProfile(ex4).draw()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-22-5e149133a64d> in <module>()
----> 1 SchneidmanProfile(ex4).draw()

NameError: name 'SchneidmanProfile' is not defined
The Schneidman profile for example 4

References

[AP12]Samer A Abdallah and Mark D Plumbley. A measure of statistical complexity based on predictive information with application to finite spin systems. Physics Letters A, 376(4):275–281, January 2012. doi:10.1016/j.physleta.2011.10.066.
[ASBY14]B. Allen, B. C. Stacey, and Y. Bar-Yam. An information-theoretic formalism for multiscale structure in complex systems. arXiv preprint arXiv:1409.4708, 2014.
[BY04]Y. Bar-Yam. Multiscale complexity/entropy. Advances in Complex Systems, 7(01):47–63, 2004.
[Bel03]Anthony J Bell. The Co-Information Lattice. Springer, New York, ica 2003 edition, 2003.
[Cal02]Cristian Calude. Information and Randomness: An Algorithmic Perspective. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2nd edition, 2002. ISBN 3540434666.
[CT06]Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, New York, second edition, 2006. ISBN 0471241954.
[GCJ+13]Virgil Griffith, Edwin KP Chong, Ryan G James, Christopher J Ellison, and James P Crutchfield. Intersection information based on common randomness. arXiv preprint arXiv:1310.1538, 2013.
[GacsKorner73]Peter Gács and János Körner. Common information is far less than mutual information. Problems of Control and Information Theory, 2(2):149–162, 1973.
[Han80]Te Sun Han. Multiple mutual informations and multiple interactions in frequency data. Information and Control, 46(1):26–45, July 1980. doi:10.1016/S0019-9958(80)90478-7.
[LSAgro11]Frank Lad, Giuseppe Sanfilippo, and Gianna Agrò. Extropy: a complementary dual of entropy. arXiv preprint arXiv:1109.6440, 2011.
[Mac03]David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
[RCVW04]Murali Rao, Yunmei Chen, Baba C Vemuri, and Fei Wang. Cumulative residual entropy: a new measure of information. Information Theory, IEEE Transactions on, 50(6):1220–1228, 2004.
[SSB+03]E. Schneidman, S. Still, M. J. Berry, W. Bialek, and others. Network information and connected correlations. Phys. Rev. Lett., 91(23):238701, 2003.
[TSE94]Giulio Tononi, Olaf Sporns, and Gerald M Edelman. A measure for brain complexity: relating functional segregation and integration in the nervous system. Proceedings of the National Academy of Sciences, 91(11):5033–5037, 1994.
[VW08]Sergio Verdu and Tsachy Weissman. The information lost in erasures. Information Theory, IEEE Transactions on, 54(11):5030–5058, 2008.
[Wat60]Satosi Watanabe. Information Theoretical Analysis of Multivariate Correlation. IBM Journal of Research and Development, 4(1):66–82, January 1960. doi:10.1147/rd.41.0066.
[Yeu08]Raymond W Yeung. Information theory and network coding. Springer, 2008.

Changelog

1.0.0 future

  • [Feature]: Basic functionality.
  • [Feature] #26: Add the Jensen-Shannon Divergence, a measure of distribution distance.
  • [Feature] #5: Add the oft-used Total Correlation.
  • [Feature] #10: Add the Co-Information.
  • [Feature] #30: Add the Gács-Körner Common Information.
  • [Feature] #7: Add the Residual Entropy.
  • [Feature] #6: Add the Binding Information.
  • [Feature] #2: Add the Extropy.
  • [Feature] #33: Add the Perplexity.
  • [Feature] #45: Add the Interaction Information.
  • [Feature] #47: Add the TSE Complexity.
  • [Feature] #16: Add the Channel Capacity.
  • [Feature] #1: Add the Cumulative Residual Entropy.
  • [Feature] #14: Add the creation of Minimal Sufficient Statistics.
  • [Feature] #35: Add the cross entropy.
  • [Feature] #34: Add the Kullback-Leibler Divergence.
  • [Feature] #3: Add the Renyi entropy.
  • [Feature] #4: Add the Tsallis entropy.
  • [Feature] #87: Add Renyi, Tsallis, Hellinger, and alpha divergences.
  • [Feature] #13: Add the Joint Minimal Sufficient Statistic.
  • [Feature] #95: Add the complexity profile.
  • [Feature] #96: Add the marginal utility of information.
  • [Support] #32: Use six for python 2/3 compatibility.
  • [Support] #40: Use an Enum for rv_mode.
  • [Support] #75: Enable coveralls to display detailed coverage information.
  • [Support] #77: Enable landscape.io to do static code analysis.
  • [Support] #58: Alias Dual Total Correlation as Binding Information.
Read the Docs v: latest
Versions
latest
update_docs
Downloads
pdf
htmlzip
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.