dit
: discrete information theory¶dit
is a Python package for discrete information theory.
For a quick tour, see the Quickstart. Otherwise, work your way through the various sections. Note that all code snippets in this documentation assume that the following lines of code have already been run:
In [1]: from __future__ import division # true division for Python 2.7
In [2]: import dit
In [3]: import numpy as np
Contents:
Until dit
is available on PyPI, the easiest way to install is:
pip install git+https://github.com/dit/dit/#egg=dit
Alternatively, you can clone this repository, move into the newly created
dit
directory, and then install the package:
git clone https://github.com/dit/dit.git
cd dit
pip install .
The basic usage of dit
corresponds to creating distributions, modifying
them if need be, and then computing properties of those distributions.
First, we import:
In [1]: import dit
Suppose we have a really thick coin, one so thick that there is a reasonable
chance of it landing on its edge. Here is how we might represent the coin in
dit
.
In [2]: d = dit.Distribution(['H', 'T', 'E'], [.4, .4, .2])
In [3]: print(d)
Class: Distribution
Alphabet: ('E', 'H', 'T') for all rvs
Base: linear
Outcome Class: str
Outcome Length: 1
RV Names: None
x p(x)
E 0.2
H 0.4
T 0.4
Calculate the probability of \(H\) and also of the combination: \(H~\mathbf{or}~T\).
In [4]: d['H']
Out[4]: 0.40000000000000002
In [5]: d.event_probability(['H','T'])
Out[5]: 0.80000000000000004
Calculate the Shannon entropy and extropy of the joint distribution.
In [6]: dit.shannon.entropy(d)
Out[6]: 1.5219280948873621
In [7]: dit.other.extropy(d)
Out[7]: 1.1419011889093373
Create a distribution representing the \(\mathbf{xor}\) logic function. Here, we have two inputs, \(X\) and \(Y\), and then an output \(Z = \mathbf{xor}(X,Y)\).
In [8]: import dit.example_dists
In [9]: d = dit.example_dists.Xor()
In [10]: d.set_rv_names(['X', 'Y', 'Z'])
In [11]: print(d)
Class: Distribution
Alphabet: ('0', '1') for all rvs
Base: linear
Outcome Class: str
Outcome Length: 3
RV Names: ('X', 'Y', 'Z')
x p(x)
000 0.25
011 0.25
101 0.25
110 0.25
Calculate the Shannon mutual informations \(\I[X:Z]\), \(\I[Y:Z]\), and \(\I[X,Y:Z]\).
In [12]: dit.shannon.mutual_information(d, ['X'], ['Z'])
Out[12]: 0.0
In [13]: dit.shannon.mutual_information(d, ['Y'], ['Z'])
Out[13]: 0.0
In [14]: dit.shannon.mutual_information(d, ['X', 'Y'], ['Z'])
Out[14]: 1.0
Calculate the marginal distribution \(P(X,Z)\). Then print its probabilities as fractions, showing the mask.
In [15]: d2 = d.marginal(['X', 'Z'])
In [16]: print(d2.to_string(show_mask=True, exact=True))
Class: Distribution
Alphabet: ('0', '1') for all rvs
Base: linear
Outcome Class: str
Outcome Length: 2 (mask: 3)
RV Names: ('X', 'Z')
x p(x)
0*0 1/4
0*1 1/4
1*0 1/4
1*1 1/4
Convert the distribution probabilities to log (base 3.5) probabilities, and access its probability mass function.
In [17]: d2.set_base(3.5)
In [18]: d2.pmf
Out[18]: array([-1.10658951, -1.10658951, -1.10658951, -1.10658951])
Draw 5 random samples from this distribution.
In [19]: d2.rand(5)
Out[19]: ['01', '10', '00', '01', '00']
Enjoy!
dit
is a scientific tool, and so, much of this documentation will contain
mathematical expressions. Here we will describe this notation.
A random variable \(X\) consists of outcomes \(x\) from an alphabet \(\mathcal{X}\). As such, we write the entropy of a distribution as \(\H[X] = \sum_{x \in \mathcal{X}} p(x) \log_2 p(x)\), where \(p(x)\) denote the probability of the outcome \(x\) occuring.
Many distributions are joint distribution. In the absence of variable names, we index each random variable with a subscript. For example, a distribution over three variables is written \(X_0X_1X_2\). As a shorthand, we also denote those random variables as \(X_{0:3}\), meaning start with \(X_0\) and go through, but not including \(X_3\) — just like python slice notation.
If we ever need to describe an infinitely long chain of variables we drop the index from the side that is infinite. So \(X_{:0} = \ldots X_{-3}X_{-2}X_{-1}\) and \(X_{0:} = X_0X_1X_2\ldots\). For an arbitrary set of indices \(A\), the corresponding collection of random variables is denoted \(X_A\). For example, if \(A = \{0,2,4\}\), then \(X_A = X_0 X_2 X_4\). The complement of \(A\) (with respect to some universal set) is denoted \(\bar{A}\).
Furthermore, we define \(0 \log_2 0 = 0\).
When there exists a function \(Y = f(X)\) we write \(X \imore Y\) meaning that \(X\) is informationally richer than \(Y\). Similarly, if \(f(Y) = X\) then we write \(X \iless Y\) and say that \(X\) is informationally poorer than \(Y\). If \(X \iless Y\) and \(X \imore Y\) then we write \(X \ieq Y\) and say that \(X\) is informationally equivalent to \(Y\). Of all the variables that are poorer than both \(X\) and \(Y\), there is a richest one. This variable is known as the meet of \(X\) and \(Y\) and is denoted \(X \meet Y\). By definition, \(\forall Z s.t. Z \iless X\) and \(Z \iless Y, Z \iless X \meet Y\). Similarly of all variables richer than both \(X\) and \(Y\), there is a poorest. This variable is known as the join of \(X\) and \(Y\) and is denoted \(X \join Y\). The joint random variable \((X,Y)\) and the join are informationally equivalent: \((X,Y) \ieq X \join Y\).
Lastly, we use \(X \mss Y\) to denote the minimal sufficient statistic of \(X\) about the random variable \(Y\).
Here we describe how to create, modify, and manipulate distribution objects.
ScalarDistribution specific stuff.
ScalarDistribution.
__init__
(outcomes, pmf=None, sample_space=None, base=None, prng=None, sort=True, sparse=True, trim=True, validate=True)¶Initialize the distribution.
Parameters: |
|
---|---|
Raises: |
|
The primary method of constructing a distribution is by supplying both the outcomes and the probability mass function:
In [1]: from dit import Distribution
In [2]: outcomes = ['000', '011', '101', '110']
In [3]: pmf = [1/4]*4
In [4]: xor = Distribution(outcomes, pmf)
In [5]: print(xor)
Class: Distribution
Alphabet: ('0', '1') for all rvs
Base: linear
Outcome Class: str
Outcome Length: 3
RV Names: None
x p(x)
000 0.25
011 0.25
101 0.25
110 0.25
Another way to construct a distribution is by supplying a dictionary mapping outcomes to probabilities:
In [6]: outcomes_probs = {'000': 1/4, '011': 1/4, '101': 1/4, '110': 1/4}
In [7]: xor2 = Distribution(outcomes_probs)
In [8]: print(xor2)
Class: Distribution
Alphabet: ('0', '1') for all rvs
Base: linear
Outcome Class: str
Outcome Length: 3
RV Names: None
x p(x)
000 0.25
011 0.25
101 0.25
110 0.25
Distribution.
__init__
(outcomes, pmf=None, sample_space=None, base=None, prng=None, sort=True, sparse=True, trim=True, validate=True)¶Initialize the distribution.
Parameters: |
|
---|---|
Raises: |
|
To verify that these two distributions are the same, we can use the is_approx_equal method:
In [9]: xor.is_approx_equal(xor2)
Out[9]: True
Distribution.
is_approx_equal
(other, rtol=None, atol=None)¶Returns True is other is approximately equal to this distribution.
For two distributions to be equal, they must have the same sample space and must also agree on the probabilities of each outcome.
Parameters: |
|
---|
Notes
The distributions need not have the length, but they must have the same base.
There are several operations possible on joint random variables.
Distribution.
marginal
(rvs, rv_mode=None)¶Returns a marginal distribution.
Parameters: |
|
---|---|
Returns: | d – A new joint distribution with the random variables in rvs kept and all others marginalized. |
Return type: | joint distribution |
Distribution.
condition_on
(crvs, rvs=None, rv_mode=None, extract=False)¶Returns distributions conditioned on random variables crvs
.
Optionally, rvs
specifies which random variables should remain.
NOTE: Eventually this will return a conditional distribution.
Parameters: |
|
---|---|
Returns: |
|
Examples
First we build a distribution P(X,Y,Z) representing the XOR logic gate.
>>> pXYZ = dit.example_dists.Xor()
>>> pXYZ.set_rv_names('XYZ')
We can obtain the conditional distributions P(X,Z|Y) and the marginal of the conditioned variable P(Y) as follows:
>>> pY, pXZgY = pXYZ.condition_on('Y')
If we specify rvs='Z'
, then only ‘Z’ is kept and thus, ‘X’ is
marginalized out:
>>> pY, pZgY = pXYZ.condition_on('Y', rvs='Z')
We can condition on two random variables:
>>> pXY, pZgXY = pXYZ.condition_on('XY')
The equivalent call using indexes is:
>>> pXY, pZgXY = pXYZ.condition_on([0, 1], rv_mode='indexes')
We can construct the join of two random variables:
Where \(\min\) is understood to be minimizing with respect to the entropy.
join
(dist, rvs, rv_mode=None, int_outcomes=True)¶Returns the distribution of the join of random variables defined by rvs.
Parameters: |
|
---|---|
Returns: | d – The distribution of the join. |
Return type: | ScalarDistribution |
insert_join
(dist, idx, rvs, rv_mode=None)¶Returns a new distribution with the join inserted at index idx.
The join of the random variables in rvs is constructed and then inserted into at index idx.
Parameters: |
|
---|---|
Returns: | d – The new distribution with the join at index idx. |
Return type: | Distribution |
We can construct the meet of two random variabls:
Where \(\max\) is understood to be maximizing with respect to the entropy.
meet
(dist, rvs, rv_mode=None, int_outcomes=True)¶Returns the distribution of the meet of random variables defined by rvs.
Parameters: |
|
---|---|
Returns: | d – The distribution of the meet. |
Return type: | ScalarDistribution |
insert_meet
(dist, idx, rvs, rv_mode=None)¶Returns a new distribution with the meet inserted at index idx.
The meet of the random variables in rvs is constructed and then inserted into at index idx.
Parameters: |
|
---|---|
Returns: | d – The new distribution with the meet at index idx. |
Return type: | Distribution |
This method constructs the minimal sufficient statistic of \(X\) about \(Y\): \(X \mss Y\):
Again, \(\min\) is understood to be over entropies.
mss
(dist, rvs, about=None, rv_mode=None, int_outcomes=True)¶Parameters: |
|
---|---|
Returns: | d – The distribution of the minimal sufficient statistic. |
Return type: | ScalarDistribution |
Examples
>>> d = Xor()
>>> print(mss(d, [0], [1, 2]))
Class: ScalarDistribution
Alphabet: (0, 1)
Base: linear
x p(x)
0 0.5
1 0.5
insert_mss
(dist, idx, rvs, about=None, rv_mode=None)¶Inserts the minimal sufficient statistic of rvs about about into dist at index idx.
Parameters: |
|
---|---|
Returns: | d – The distribution dist modified to contain the minimal sufficient statistic. |
Return type: | Distribution |
Examples
>>> d = Xor()
>>> print(insert_mss(d, -1, [0], [1, 2]))
Class: Distribution
Alphabet: ('0', '1') for all rvs
Base: linear
Outcome Class: str
Outcome Length: 4
RV Names: None
x p(x)
0000 0.25
0110 0.25
1011 0.25
1101 0.25
dit
supports many information measures, ranging from as standard as the
Shannon entropy to as exotic as Gács-Körner common information (with even more
esoteric measure coming soon!). We organize these quantities into the following
groups.
We first have the Shannon-like measures. These quantities are based on sums and differences of entropies, conditional entropies, or mutual informations of random variables:
The information on this page is drawn from the fantastic text book Elements of Information Theory by Cover and Thomas [CT06]. Other good choices are Information Theory, Inference and Learning Algorithms by MacKay [Mac03] and Information Theory and Network Coding by Yeung [Yeu08].
The entropy measures how much information is in a random variable \(X\).
What do we mean by “how much information”? Basically, we mean the average number of yes-no questions one would have to ask to determine an outcome from the distribution. In the simplest case, consider a sure thing:
In [1]: d = dit.Distribution(['H'], [1])
In [2]: dit.shannon.entropy(d)
Out[2]: 0.0
So since we know that the outcome from our distribution will always be H, we have to ask zero questions to figure that out. If however we have a fair coin:
In [3]: d = dit.Distribution(['H', 'T'], [1/2, 1/2])
In [4]: dit.shannon.entropy(d)
Out[4]: 1.0
The entropy tells us that we must ask one question to determine whether an H or T was the outcome of the coin flip. Now what if there are three outcomes? Let’s consider the following situation:
In [5]: d = dit.Distribution(['A', 'B', 'C'], [1/2, 1/4, 1/4])
In [6]: dit.shannon.entropy(d)
Out[6]: 1.5
Here we find that the entropy is 1.5 bits. How do we ask one and a half questions on average? Well, if our first question is “was it A?” and it is true, then we are done, and that occurs half the time. The other half of the time we need to ask a follow up question: “was it B?”. So half the time we need to ask one question, and the other half of the time we need to ask two questions. In other words, we need to ask 1.5 questions on average.
The entropy of multiple variables is computed in a similar manner:
Its intuition is also the same: the average number of binary questions required to identify a joint event from the distribution.
entropy
(dist, rvs=None, rv_mode=None)¶Returns the entropy H[X] over the random variables in rvs.
If the distribution represents linear probabilities, then the entropy is calculated with units of ‘bits’ (base-2). Otherwise, the entropy is calculated in whatever base that matches the distribution’s pmf.
Parameters: |
|
---|---|
Returns: | H – The entropy of the distribution. |
Return type: |
The conditional entropy is the amount of information in variable \(X\) beyond that which is in variable \(Y\):
As a simple example, consider two identical variables:
In [7]: d = dit.Distribution(['HH', 'TT'], [1/2, 1/2])
In [8]: dit.shannon.conditional_entropy(d, [0], [1])
Out[8]: 0.0
We see that knowing the second variable tells us everything about the first, leaving zero entropy. On the other end of the spectrum, two independent variables:
In [9]: d = dit.Distribution(['HH', 'HT', 'TH', 'TT'], [1/4]*4)
In [10]: dit.shannon.conditional_entropy(d, [0], [1])
Out[10]: 1.0
Here, the second variable tells us nothing about the first so we are left with the one bit of information a coin flip has.
conditional_entropy
(dist, rvs_X, rvs_Y, rv_mode=None)¶Returns the conditional entropy of H[X|Y].
If the distribution represents linear probabilities, then the entropy is calculated with units of ‘bits’ (base-2).
Parameters: |
|
---|---|
Returns: | H_XgY – The conditional entropy H[X|Y]. |
Return type: |
The mutual information is the amount of information shared by \(X\) and \(Y\):
The mutual information is symmetric:
Meaning that the information that \(X\) carries about \(Y\) is equal to the information that \(Y\) carries about \(X\). The entropy of \(X\) can be decomposed into the information it shares with \(Y\) and the information it doesn’t:
See also
The mutual information generalized to the multivariate case in three different ways:
mutual_information
(dist, rvs_X, rvs_Y, rv_mode=None)¶Returns the mutual information I[X:Y].
If the distribution represents linear probabilities, then the entropy is calculated with units of ‘bits’ (base-2).
Parameters: |
|
---|---|
Returns: | I – The mutual information I[X:Y]. |
Return type: |
It has been shown that there is a correspondence between set-theoretic measures and information-theoretic measures. The entropy is equivalent to set cardinality, mutual information to set intersection, and conditional entropy to set difference. Because of this we can use Venn-like diagrams to represent the information in and shared between random variables. These diagrams are called information diagrams or i-diagrams for short.
This first image pictographically shades the area of the i-diagram which contains the information corresponding to \(\H[X_0]\).
Similarly, this one shades the information corresponding to \(\H[X_1]\).
This image shades the information corresponding to \(\H[X_0, X_1]\). Notice that it is the union of the prior two, and not their sum (e.g. that overlap region is not double-counted).
Next, the conditional entropy of \(X_0\) conditioned on \(X_1\), \(\H[X_0|X_1]\), is displayed. It consists of the area contained in the \(X_0\) circle but not contained in \(X_1\) circle.
In the same vein, here the conditional entropy \(\H[X_1|X_0]\) is shaded.
Finally, the mutual information between \(X_0\) and \(X_1\), \(I[X_0:X_1]\) is drawn. It is the region where the two circles overlap.
Multivariate measures of information generally attempt to capture some global property of a joint distribution. For example, they might attempt to quantify how much information is shared among the random variables, or quantify how “non-indpendent” in the joint distribution is.
The first batch are all linear combinations of atoms from an I-diagram, and are therefore computable using only the entropy of subsets of the distribution.
The entropy measures the total amount of information contained in a set of random variables, \(X_{0:n}\), potentially excluding the information contain in others, \(Y_{0:m}\).
Let’s consider two coins that are interdependent: the first coin fips fairly, and if the first comes up heads, the other is fair, but if the firt comes up tails the other is certainly tails:
In [1]: d = dit.Distribution(['HH', 'HT', 'TT'], [1/4, 1/4, 1/2])
We would expect that entropy of the second coin conditioned on the first coin would be \(0.5\) bits, and sure enough that is what we find:
In [2]: from dit.multivariate import entropy
In [3]: entropy(d, [1], [0])
Out[3]: 0.49999999999999989
And since the first coin is fair, we would expect it to have an entropy of \(1\) bit:
In [4]: entropy(d, [0])
Out[4]: 1.0
Taken together, we would then expect the joint entropy to be \(1.5\) bits:
In [5]: entropy(d)
Out[5]: 1.5
Below we have a pictoral representation of the joint entropy for both 2 and 3 variable joint distributions.
entropy
(dist, rvs=None, crvs=None, rv_mode=None)¶Compute the conditional joint entropy.
Parameters: |
|
---|---|
Returns: | H – The entropy. |
Return type: | |
Raises: |
|
Examples
Let’s construct a 3-variable distribution for the XOR logic gate and name the random variables X, Y, and Z.
>>> d = dit.example_dists.Xor()
>>> d.set_rv_names(['X', 'Y', 'Z'])
The joint entropy of H[X,Y,Z] is:
>>> dit.multivariate.entropy(d, 'XYZ')
2.0
We can do this using random variables indexes too.
>>> dit.multivariate.entropy(d, [0,1,2], rv_mode='indexes')
2.0
The joint entropy H[X,Z] is given by:
>>> dit.multivariate.entropy(d, 'XZ')
1.0
Conditional entropy can be calculated by passing in the conditional random variables. The conditional entropy H[Y|X] is:
>>> dit.multivariate.entropy(d, 'Y', 'X')
1.0
The co-information [Bel03] is one generalization of the mutual information to multiple variables. The co-information quantifies the amount of infomration that all variables participate in. It is defined via an inclusion/exclusion sum:
It is clear that the co-information measures the “center-most” atom of the diagram only, which is the only atom to which every variable contributes. To exemplifying this, consider “giant bit” distributions:
In [1]: from dit import Distribution as D
In [2]: from dit.multivariate import coinformation as I
In [3]: [ I(D(['0'*n, '1'*n], [1/2, 1/2])) for n in range(2, 6) ]
Out[3]: [1.0, 1.0, 1.0, 1.0]
This verifies intuition that the entire one bit of the distribution’s entropy is condensed in a single atom. One notable property of the co-information is that for \(n \geq 3\) it can be negative. For example:
In [4]: from dit.example_dists import Xor
In [5]: d = Xor()
In [6]: I(d)
Out[6]: -1.0
Based on these two examples one might get the impression that the co-information is positive for “redundant” distributions and negative for “synergistic” distributions. This however is not true — consider the four-variable parity distribution:
In [7]: from dit.example_dists import n_mod_m
In [8]: d = n_mod_m(4, 2)
In [9]: I(d)
Out[9]: 1.0
Meaning that the co-information is positive for both the most redundant distribution, the giant bit, and the most synergistic, the parity. Therefore the coinformation can not be used to measure redundancy or synergy.
Note
Correctly measuring redundancy and synergy is an ongoing problem. See [GCJ+13] and references therein for the current status of the problem.
The co-information can be visuallized on an i-diagram as below, where only the centermost atom is shaded:
coinformation
(dist, rvs=None, crvs=None, rv_mode=None)¶Calculates the coinformation.
Parameters: |
|
---|---|
Returns: | I – The coinformation. |
Return type: | |
Raises: |
|
Examples
Let’s construct a 3-variable distribution for the XOR logic gate and name the random variables X, Y, and Z.
>>> d = dit.example_dists.Xor()
>>> d.set_rv_names(['X', 'Y', 'Z'])
To calculate coinformations, recall that rvs specifies which groups of random variables are involved. For example, the 3-way mutual information I[X:Y:Z] is calculated as:
>>> dit.multivariate.coinformation(d, ['X', 'Y', 'Z'])
-1.0
It is a quirk of strings that each element of a string is also an iterable. So an equivalent way to calculate the 3-way mutual information I[X:Y:Z] is:
>>> dit.multivariate.coinformation(d, 'XYZ')
-1.0
The reason this works is that list(‘XYZ’) == [‘X’, ‘Y’, ‘Z’]. If we want to use random variable indexes, we need to have explicit groupings:
>>> dit.multivariate.coinformation(d, [[0], [1], [2]], rv_mode='indexes')
-1.0
To calculate the mutual information I[X, Y : Z], we use explicit groups:
>>> dit.multivariate.coinformation(d, ['XY', 'Z'])
Using indexes, this looks like:
>>> dit.multivariate.coinformation(d, [[0, 1], [2]], rv_mode='indexes')
The mutual information I[X:Z] is given by:
>>> dit.multivariate.coinformation(d, 'XZ')
0.0
Equivalently,
>>> dit.multivariate.coinformation(d, ['X', 'Z'])
0.0
Using indexes, this becomes:
>>> dit.multivariate.coinformation(d, [[0], [2]])
0.0
Conditional mutual informations can be calculated by passing in the conditional random variables. The conditional entropy I[X:Y|Z] is:
>>> dit.multivariate.coinformation(d, 'XY', 'Z')
1.0
Using indexes, this becomes:
>>> rvs = [[0], [1]]
>>> crvs = [[2]] # broken
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
1.0
For the conditional random variables, groupings have no effect, so you can also obtain this as:
>>> rvs = [[0], [1]]
>>> crvs = [2]
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
1.0
Finally, note that entropy can also be calculated. The entropy H[Z|XY] is obtained as:
>>> rvs = [[2]]
>>> crvs = [[0], [1]] # broken
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
0.0
>>> crvs = [[0, 1]] # broken
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
0.0
>>> crvs = [0, 1]
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
0.0
>>> rvs = 'Z'
>>> crvs = 'XY'
>>> dit.multivariate.coinformation(d, rvs, crvs, rv_mode='indexes')
0.0
Note that [[0], [1]] says to condition on two groups. But conditioning is a flat operation and doesn’t respect the groups, so it is equal to a single group of 2 random variables: [[0, 1]]. With random variable names ‘XY’ is acceptable because list(‘XY’) = [‘X’, ‘Y’], which is species two singleton groups. By the previous argument, this is will be treated the same as [‘XY’].
The interaction information is equal in magnitude to the co-information, but has the opposite sign when taken over an odd number of variables:
Interaction information was first studied in the 3-variable case which, for \(X_{0:3} = X_0X_1X_2\), takes the following form:
The extension to \(n > 3\) proceeds recursively. For example,
See also
For more information, see Co-Information.
interaction_information
(dist, rvs=None, crvs=None, rv_mode=None)¶Calculates the interaction information.
Parameters: |
|
---|---|
Returns: | II – The interaction information. |
Return type: | |
Raises: |
|
The total correlation [Wat60], denoted \(\T\), also known as the multi-information or integration, is one generalization of the mutual information. It is defined as the amount of information each individual variable carries above and beyond the joint entropy, e.g. the difference between the whole and the sum of its parts:
Two nice features of the total correlation are that it is non-negative and that it is zero if and only if the random variables \(X_{0:n}\) are all independent. Some baseline behavior is good to note also. First its behavior when applied to “giant bit” distributions:
In [1]: from dit import Distribution as D
In [2]: from dit.multivariate import total_correlation as T
In [3]: [ T(D(['0'*n, '1'*n], [0.5, 0.5])) for n in range(2, 6) ]
Out[3]: [1.0, 2.0, 3.0, 4.0]
So we see that for giant bit distributions, the total correlation is equal to one less than the number of variables. The second type of distribution to consider is general parity distributions:
In [4]: from dit.example_dists import n_mod_m
In [5]: [ T(n_mod_m(n, 2)) for n in range(3, 6) ]
Out[5]: [1.0, 1.0, 1.0]
In [6]: [ T(n_mod_m(3, m)) for m in range(2, 5) ]
Out[6]: [1.0, 1.5849625007211565, 2.0]
Here we see that the total correlation is equal to \(\log_2{m}\) regardless of \(n\).
The total correlation follows a nice decomposition rule. Given two sets of (not necessarily independent) random variables, \(A\) and \(B\), the total correaltion of \(A \cup B\) is:
In [7]: from dit.multivariate import coinformation as I
In [8]: d = n_mod_m(4, 3)
In [9]: T(d) == T(d, [[0], [1]]) + T(d, [[2], [3]]) + I(d, [[0, 1], [2, 3]])
Out[9]: True
The total correlation consists of all information that is shared among the variables, and weights each piece according to how many variables it is shared among.
total_correlation
(dist, rvs=None, crvs=None, rv_mode=None)¶Parameters: |
|
---|---|
Returns: | T – The total correlation. |
Return type: |
Examples
>>> d = dit.example_dists.Xor()
>>> dit.multivariate.total_correlation(d)
1.0
>>> dit.multivariate.total_correlation(d, rvs=[[0], [1]])
0.0
Raises: | ditException –
Raised if dist is not a joint distribution or if rvs or crvs
contain non-existant random variables. |
---|
The binding information [AP12], or dual total correlation, is yet another generalization of the mutual information. It is the amount of information that is shared among the variables. It is defined as:
In a sense the binding information captures the same information that the Total Correlation does, in that both measures are zero or non-zero together. However, the two measures take on very different quantitative values for different distributions. By way of example, the type of distribution that maximizes the total correlation is a “giant bit”:
In [1]: from dit.multivariate import binding_information, total_correlation
In [2]: d = dit.Distribution(['000', '111'], [1/2, 1/2])
In [3]: total_correlation(d)
Out[3]: 2.0
In [4]: binding_information(d)
Out[4]: 1.0
For the same distribution, the binding information takes on a relatively low value. On the other hand, the type of distribution that maximizes the binding information is a “parity” distribution:
In [5]: from dit.example_dists import n_mod_m
In [6]: d = n_mod_m(3, 2)
In [7]: total_correlation(d)
Out[7]: 1.0
In [8]: binding_information(d)
Out[8]: 2.0
The binding information obeys particular bounds related to both the Entropy and the Total Correlation:
The binding information, as seen below, consists equally of the information shared among the variables.
binding_information
(dist, rvs=None, crvs=None, rv_mode=None)¶Parameters: |
|
---|---|
Returns: | B – The binding information. |
Return type: | |
Raises: |
|
The residual entropy, or erasure entropy, is a dual to the Binding Information. It is dual in the sense that together they form the entropy of the distribution.
The residual entropy was originally proposed in [VW08] to quantify the information lost by sporatic erasures in a channel. The idea here is that only the information uncorrelated with other random variables is lost if that variable is erased.
If a joint distribution consists of independent random variables, the residual entropy is equal to the Entropy:
In [1]: from dit.multivariate import entropy, residual_entropy
In [2]: d = dit.uniform_distribution(3, 2)
In [3]: entropy(d) == residual_entropy(d)
Out[3]: True
Another simple example is a distribution where one random variable is independent of the others:
In [4]: d = dit.uniform(['000', '001', '110', '111'])
In [5]: residual_entropy(d)
Out[5]: 1.0
If we ask for the residual entropy of only the latter two random variables, the middle one is now independent of the others and so the residual entropy grows:
In [6]: residual_entropy(d, [[1], [2]])
Out[6]: 2.0
The residual entropy consists of all the unshared information in the distribution. That is, it is the information in each variable not overlapping with any other.
residual_entropy
(dist, rvs=None, crvs=None, rv_mode=None)¶Parameters: |
|
---|---|
Returns: | R – The residual entropy. |
Return type: | |
Raises: |
|
The Tononi-Sporns-Edelmans (TSE) complexity [TSE94] is a complexity measure for distributions. It is designed so that it maximized by distributions where small subsets of random variables are loosely coupled but the overall distribution is tightly coupled.
Two distributions which might be considered tightly coupled are the “giant bit” and the “parity” distributions:
In [1]: from dit.multivariate import tse_complexity
In [2]: from dit.example_dists import Xor
In [3]: d1 = Xor()
In [4]: tse_complexity(d1)
Out[4]: 1.0
In [5]: d2 = dit.Distribution(['000', '111'], [1/2, 1/2])
In [6]: tse_complexity(d2)
Out[6]: 1.0
The TSE Complexity assigns them both a value of \(1.0\) bits, which is the maximal value the TSE takes over trivariate, binary alphabet distributions.
tse_complexity
(dist, rvs=None, crvs=None, rv_mode=None)¶Calculates the TSE complexity.
Parameters: |
|
---|---|
Returns: | TSE – The TSE complexity. |
Return type: | |
Raises: |
|
The next set of measures are based on the definition of an auxiliary random variable, generally one which in some way captures the information common to the random variables.
The Gács-Körner common information [GacsKorner73] take a very direct approach to the idea of common information. It extracts a random variable that is contained within each of the random variables under consideration.
Let’s play a game. We have an n-variable joint distribution, and one player for each variable. Each player is given the probability mass function of the joint distribution then isolated from each other. Each round of the game the a joint outcome is generated from the distribution and each player is told the symbol that their particular variable took. The goal of the game is for the players to simultaneously write the same symbol on a piece of paper, and for the entropy of the players’ symbols to be maximized. They must do this using only their knowledge of the joint random variable and the particular outcome of their marginal variable. The matching symbols produced by the players are called the common random variable and the entropy of that variable is the Gács-Körner common information, \(\K\).
Consider a joint distribution over \(X_0\) and \(X_1\). Given any particular outcome from that joint, we want a function \(f(X_0)\) and a function \(g(X_1)\) such that \(\forall x_0x_1 = X_0X_1, f(x_0) = g(x_1) = v\). Of all possible pairs of functions \(f(X_0) = g(X_1) = V\), there exists a “largest” one, and it is known as the common random variable. The entropy of that common random variable is the Gács-Körner common information:
As a canonical example, consider the following:
In [1]: from dit import Distribution as D
In [2]: from dit.multivariate import gk_common_information as K
In [3]: outcomes = ['00', '01', '10', '11', '22', '33']
In [4]: pmf = [1/8, 1/8, 1/8, 1/8, 1/4, 1/4]
In [5]: d = D(outcomes, pmf, sample_space=outcomes)
In [6]: K(d)
Out[6]: 1.5
Note
It is important that we set the sample_space argument. If it is None then the Cartesian product of each alphabet, and in such a case the meet will trivially be degenerate.
So, the Gács-Körner common information is 1.5 bits. But what is the common random variable?
In [7]: from dit.algorithms import insert_meet
In [8]: crv = insert_meet(d, -1, [[0],[1]])
In [9]: print(crv)
Class: Distribution
Alphabet: (('0', '1', '2', '3'), ('0', '1', '2', '3'), ('0', '1', '2'))
Base: linear
Outcome Class: str
Outcome Length: 3
RV Names: None
x p(x)
002 0.125
012 0.125
102 0.125
112 0.125
220 0.25
331 0.25
Looking at the third index of the outcomes, we see that the common random variable maps 2 to 0 and 3 to 1, maintaining the information from those values. When \(X_0\) or \(X_1\) are either 0 or 1, however, it maps them to 2. This is because \(f\) and \(g\) must act independently: if \(x_0\) is a 0 or a 1, there is no way to know if \(x_1\) is a 0 or a 1 and vice versa. Therefore we aggregate 0s and 1s into 2.
The Gács-Körner common information is the largest “circle” that entirely fits within the mutual information’s “football”:
The Gács-Körner common information satisfies an important inequality:
One usage of the common information is as a measure of redundancy [GCJ+13]. Consider a function that takes two inputs, \(X_0\) and \(X_1\), and produces a single output \(Y\). The output can be influenced redundantly by both inputs, uniquely from either one, or together they can synergistically influence the output. Determining how to compute the amount of redundancy is an open problem, but one proposal is:
Which can be visualized as this:
This quantity can be computed easily using dit:
In [10]: from dit.example_dists import RdnXor
In [11]: from dit.shannon import mutual_information as I
In [12]: d = RdnXor()
In [13]: d = dit.pruned_samplespace(d)
In [14]: d = insert_meet(d, -1, [[0],[1]])
In [15]: I(d, [3], [2])
Out[15]: 1.0
With an arbitrary number of variables, the Gács-Körner common information is defined similarly:
The common information is a monotonically decreasing function in the number of variables:
The multivariate common information follows a similar inequality as the two variable version:
Here, as above, the Gács-Körner common information among three variables is the largest “circle” this time fiting in the vaguely triangular Co-Information region.
gk_common_information
(dist, rvs=None, crvs=None, rv_mode=None)¶Returns the Gacs-Korner common information K[X1:X2...] over the random variables in rvs.
Parameters: |
|
---|---|
Returns: | K – The Gacs-Korner common information of the distribution. |
Return type: | |
Raises: |
|
This next group of measures can not be represented on information diagrams, and can not really be directly compared to the measures above:
Other measures of information. These are generally based around alternatives to the Shannon entropy proposed for a variety of reasons.
The cumulative residual entropy [RCVW04] is an alternative to the differential Shannon entropy. The differential entropy has many issues, including that it can be negative even for simple distributions such as the uniform distribution; and that if one takes discrete estimates that limit to the continuous distribution, the discrete entropy does not limit to the differential (continuous) entropy. It also attempts to provide meaningful differences between numerically different random variables, such as a die labeled [1, 2, 3, 4, 5, 6] and one lebeled [1, 2, 3, 4, 5, 100].
Note
The Cumulative Residual Entropy is unrelated to Residual Entropy.
In [1]: from dit.other import cumulative_residual_entropy
In [2]: d1 = dit.ScalarDistribution([1, 2, 3, 4, 5, 6], [1/6]*6)
In [3]: d2 = dit.ScalarDistribution([1, 2, 3, 4, 5, 100], [1/6]*6)
In [4]: cumulative_residual_entropy(d1)
Out[4]: 2.0683182557028439
In [5]: cumulative_residual_entropy(d2)
Out[5]: 22.672680046016705
The genearlized form of the cumulative residual entropy integrates over the intire set of reals rather than just the positive ones:
In [6]: from dit.other import generalized_cumulative_residual_entropy
In [7]: generalized_cumulative_residual_entropy(d1)
Out[7]: 2.0683182557028439
In [8]: d3 = dit.ScalarDistribution([-2, -1, 0, 1, 2], [1/5]*5)
In [9]: cumulative_residual_entropy(d3)
Out[9]: 0.90656497547719606
In [10]: generalized_cumulative_residual_entropy(d3)
Out[10]: 1.6928786893420307
The conditional cumulative residual entropy \(\CRE[X|Y]\) is a distribution with the same probability mass function as \(Y\), and the outcome associated with \(p(y)\) is equal to the cumulative residual entropy over probabilities conditioned on \(Y = y\). In this sense the conditional cumulative residual entropy is more akin to a distribution over \(\H[X|Y=y]\) than the single scalar quantity \(\H[X|Y]\).
Conceptually the conditional generalized cumulative residual entropy is the same as the non-generalized form, but integrated over the entire real line rather than just the positive:
cumulative_residual_entropy
(dist, extract=False)¶The cumulative residual entropy is an alternative to the Shannon differential entropy with several desirable properties including non-negativity.
Parameters: |
|
---|---|
Returns: | CREs – The cumulative residual entropy for each index. |
Return type: | ndarray |
Examples
>>> d1 = ScalarDistribution([1, 2, 3, 4, 5, 6], [1/6]*6)
>>> d2 = ScalarDistribution([1, 2, 3, 4, 5, 100], [1/6]*6)
>>> cumulative_residual_entropy(d1)
2.0683182557028439
>>> cumulative_residual_entropy(d2)
22.672680046016705
generalized_cumulative_residual_entropy
(dist, extract=False)¶The generalized cumulative residual entropy is a generalized from of the cumulative residual entropy. Rarther than integrating from 0 to infinty over the absolute value of the CDF.
Parameters: |
|
---|---|
Returns: | GCREs – The generalized cumulative residual entropy for each index. |
Return type: | ndarray |
Examples
>>> generalized_cumulative_residual_entropy(uniform(-2, 3))
1.6928786893420307
>>> generalized_cumulative_residual_entropy(uniform(0, 5))
1.6928786893420307
conditional_cumulative_residual_entropy
(dist, rv, crvs=None, rv_mode=None)¶Returns the conditional cumulative residual entropy.
Parameters: |
|
---|---|
Returns: | CCRE – The conditional cumulative residual entropy. |
Return type: | ScalarDistribution |
Examples
>>> from itertools import product
>>> events = [ (a, b) for a, b, in product(range(5), range(5)) if a <= b ]
>>> probs = [ 1/(5-a)/5 for a, b in events ]
>>> d = Distribution(events, probs)
>>> print(conditional_cumulative_residual_entropy(d, 1, [0]))
Class: ScalarDistribution
Alphabet: (-0.0, 0.5, 0.91829583405448956, 1.3112781244591329, 1.6928786893420307)
Base: linear
x p(x) -0.0 0.2 0.5 0.2 0.918295834054 0.2 1.31127812446 0.2 1.69287868934 0.2
conditional_generalized_cumulative_residual_entropy
(dist, rv, crvs=None, rv_mode=None)¶Returns the conditional cumulative residual entropy.
Parameters: |
|
---|---|
Returns: | CCRE – The conditional cumulative residual entropy. |
Return type: | ScalarDistribution |
Examples
>>> from itertools import product
>>> events = [ (a-2, b-2) for a, b, in product(range(5), range(5)) if a <= b ]
>>> probs = [ 1/(3-a)/5 for a, b in events ]
>>> d = Distribution(events, probs)
>>> print(conditional_generalized_cumulative_residual_entropy(d, 1, [0]))
Class: ScalarDistribution
Alphabet: (-0.0, 0.5, 0.91829583405448956, 1.3112781244591329, 1.6928786893420307)
Base: linear
x p(x) -0.0 0.2 0.5 0.2 0.918295834054 0.2 1.31127812446 0.2 1.69287868934 0.2
The Rényi entropy is a spectrum of generalizations to the Shannon Entropy:
In [1]: from dit.other import renyi_entropy
In [2]: from dit.example_dists import binomial
In [3]: d = binomial(15, 0.4)
In [4]: renyi_entropy(d, 3)
Out[4]: 2.6611840717104625
For several values of \(\alpha\), the Rényi entropy takes on particular values.
When \(\alpha = 0\) the Rényi entropy becomes what is known as the Hartley entropy:
In [5]: renyi_entropy(d, 0)
Out[5]: 4.0
When \(\alpha = 1\) the Rényi entropy becomes the standard Shannon entropy:
In [6]: renyi_entropy(d, 1)
Out[6]: 2.9688513169509618
When \(\alpha = 2\), the Rényi entropy becomes what is known as the collision entropy:
where \(Y\) is an IID copy of X. This is basically the surprisal of “rolling doubles”
In [7]: renyi_entropy(d, 2)
Out[7]: 2.7607270851693615
Finally, when \(\alpha = \infty\) the Rényi entropy picks out the probability of the most-probable event:
In [8]: renyi_entropy(d, np.inf)
Out[8]: 2.275104563096674
In general, the Rényi entropy is a monotonically decreasing function in \(\alpha\):
Further, the following inequality holds in the other direction:
renyi_entropy
(dist, order, rvs=None, rv_mode=None)¶Compute the Renyi entropy of order order.
Parameters: |
|
---|---|
Returns: | H_a – The Renyi entropy. |
Return type: | |
Raises: |
|
The Tsallis entropy is a generalization of the Shannon (or Boltzmann-Gibbs) entropy to the case where entropy is nonextensive. It is given by:
In [1]: from dit.other import tsallis_entropy
In [2]: from dit.example_dists import n_mod_m
In [3]: d = n_mod_m(4, 3)
In [4]: tsallis_entropy(d, 4)
Out[4]: 0.33331639824552489
One interesting property of the Tsallis entropy is the relationship between the joint Tsallis entropy of two indpendent systems, and the Tsallis entropy of those subsystems:
tsallis_entropy
(dist, order, rvs=None, rv_mode=None)¶Compute the Tsallis entropy of order order.
Parameters: |
|
---|---|
Returns: | S_q – The Tsallis entropy. |
Return type: | |
Raises: |
|
The perplexity is a trivial measure to make the Entropy more intuitive:
The perplexity of a random variable is the size of a uniform distribution that would have the same entropy. For example, a distribution with 2 bits of entropy has a perplexity of 4, and so could be said to be “as random” as a four-sided die.
The conditional perplexity is defined in the natural way:
We can see that the xor distribution is “4-way” perplexed:
In [1]: from dit.other import perplexity
In [2]: from dit.example_dists import Xor
In [3]: perplexity(Xor())
Out[3]: 4.0
perplexity
(dist, rvs=None, crvs=None, rv_mode=None)¶Parameters: |
|
---|---|
Returns: | P – The perplexity. |
Return type: |
The extropy [LSAgro11] is a dual to the Entropy. It is defined by:
The entropy and the extropy satisify the following relationship:
Unfortunately, the extropy does not yet have any intuitive interpretation.
In [1]: from dit.other import extropy
In [2]: from dit.example_dists import Xor
In [3]: extropy(Xor())
Out[3]: 1.2451124978365313
In [4]: extropy(Xor(), [0])
Out[4]: 1.0
extropy
(dist, rvs=None, rv_mode=None)¶Returns the extropy J[X] over the random variables in rvs.
If the distribution represents linear probabilities, then the extropy is calculated with units of ‘bits’ (base-2).
Parameters: |
|
---|---|
Returns: | J – The extropy of the distribution. |
Return type: |
There are also measures of “distance” or divergence between two (and im some cases, more) distribution:
Divergences are measures of comparison between distributions:
The cross entropy between two distributions \(p(x)\) and \(q(x)\) is given by:
This quantifies the average cost of representing a distribution defined by the probabilities \(p(x)\) using the probabilities \(q(x)\). For example, the cross entropy of a distribution with itself is the entropy of that distribion because the entropy quantifies the average cost of representing a distribution:
In [1]: from dit.divergences import cross_entropy
In [2]: p = dit.Distribution(['0', '1'], [1/2, 1/2])
In [3]: cross_entropy(p, p)
Out[3]: 1.0
If, however, we attempted to model a fair coin with a biased on, we could compute this mis-match with the cross entropy:
In [4]: q = dit.Distribution(['0', '1'], [3/4, 1/4])
In [5]: cross_entropy(p, q)
Out[5]: 1.207518749639422
Meaning, we will on average use about \(1.2\) bits to represent the flips of a fair coin. Turning things around, what if we had a biased coin that we attempted to represent with a fair coin:
In [6]: cross_entropy(q, p)
Out[6]: 1.0
So although the entropy of \(q\) is less than \(1\), we will use a full bit to represent its outcomes. Both of these results can easily be seen by considering the following identity:
So in representing \(p\) using \(q\), we of course must at least use \(\H[p]\) bits – the minimum required to represent \(p\) – plus the Kullback-Leibler divergence of \(q\) from \(p\).
cross_entropy
(dist1, dist2, rvs=None, crvs=None, rv_mode=None)¶The cross entropy between dist1 and dist2.
Parameters: |
|
---|---|
Returns: | xh – The cross entropy between dist1 and dist2. |
Return type: | |
Raises: |
|
The Kullback-Leibler divergence, sometimes also called the relative entropy, of a distribution \(p\) from a distribution \(q\) is defined as:
The Kullback-Leibler divergence quantifies the average number of extra bits required to represent a distribution \(p\) when using an arbitrary distribution \(q\). This can be seen through the following identity:
Where the Cross Entropy quantifies the total cost of encoding \(p\) using \(q\), and the Entropy quantifies the true, minimum cost of encoding \(p\). For example, let’s consider the cost of representing a biased coin by a fair one:
In [1]: from dit.divergences import kullback_leibler_divergence
In [2]: p = dit.Distribution(['0', '1'], [3/4, 1/4])
In [3]: q = dit.Distribution(['0', '1'], [1/2, 1/2])
In [4]: kullback_leibler_divergence(p, q)
Out[4]: 0.18872187554086717
That is, it costs us \(0.1887\) bits of wasted overhead by using a mismatched distribution.
Although the Kullback-Leibler divergence is often used to see how “different” two distributions are, it is not a metric. Importantly, it is neither symmetric nor does it obey the triangle inequality. It does, however, have the following property:
with equality if and only if \(p = q\). This makes it a premetric.
kullback_leibler_divergence
(dist1, dist2, rvs=None, crvs=None, rv_mode=None)¶The Kullback-Liebler divergence between dist1 and dist2.
Parameters: |
|
---|---|
Returns: | dkl – The Kullback-Leibler divergence between dist1 and dist2. |
Return type: | |
Raises: |
|
The Jensen-Shannon divergence is a principled divergence measure which is always finite for finite random variables. It quantifies how “distinguishable” two or more distributions are from each other. In its basic form it is:
That is, it is the entropy of the mixture minus the mixture of the entropy. This can be generalized to an arbitrary number of random variables with arbitrary weights:
In [1]: from dit.divergences import jensen_shannon_divergence
In [2]: X = dit.ScalarDistribution(['red', 'blue'], [1/2, 1/2])
In [3]: Y = dit.ScalarDistribution(['blue', 'green'], [1/2, 1/2])
In [4]: jensen_shannon_divergence([X, Y])
Out[4]: 0.5
In [5]: jensen_shannon_divergence([X, Y], [3/4, 1/4])
Out[5]: 0.40563906222956647
In [6]: Z = dit.ScalarDistribution(['blue', 'yellow'], [1/2, 1/2])
In [7]: jensen_shannon_divergence([X, Y, Z])
Out[7]: 0.79248125036057782
In [8]: jensen_shannon_divergence([X, Y, Z], [1/2, 1/4, 1/4])
Out[8]: 0.75
Where does this equation come from? Consider Jensen’s inequality:
where \(\Psi\) is a concave function. If we consider the divergence of the left and right side we find:
If we make that concave function \(\Psi\) the Shannon entropy \(\H\), we get the Jensen-Shannon divergence. Jensen from Jensen’s inequality, and Shannon from the use of the Shannon entropy.
Note
Some people look at the Jensen-Rényi divergence (where \(\Psi\) is the Rényi Entropy) and the Jensen-Tsallis divergence (where \(\Psi\) is the Tsallis Entropy).
The square root of the Jensen-Shannon divergence, \(\sqrt{\JSD}\), is a true metric between distributions.
The Jensen-Shannon divergence can be derived from other, more well known information measures; notably the Kullback-Leibler divergence and the mutual information.
The Jensen-Shannon divergence is the average Kullback-Leibler divergence of \(X\) and \(Y\) from their mixture distribution, \(M\):
where \(M\) is the mixture distribution as before, and \(Z\) is an indicator variable over \(X\) and \(Y\). In essence, if \(X\) and \(Y\) are each an urn containing colored balls, and I randomly selected one of the urns and draw a ball from it, then the Jensen-Shannon divergence is the mutual information between which urn I drew the ball from, and the color of the ball drawn.
jensen_shannon_divergence
(dists, weights=None)¶The Jensen-Shannon Divergence: H(sum(w_i*P_i)) - sum(w_i*H(P_i)).
The square root of the Jensen-Shannon divergence is a distance metric.
Parameters: |
|
---|---|
Returns: | jsd – The Jensen-Shannon Divergence |
Return type: | |
Raises: |
|
While the cross entropy and the Kullback-Leibler divergence are not true metrics, the square root of the Jensen-Shannon divergence is.
There are several ways to decompose the information contained in a joint distribution. Here, we will demonstrate their behavior using four examples drawn from [ASBY14]:
In [1]: from dit.profiles import *
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-1-e446b5fad016> in <module>()
----> 1 from dit.profiles import *
build/bdist.linux-x86_64/egg/dit/profiles/__init__.py in <module>()
build/bdist.linux-x86_64/egg/dit/profiles/marginal_utility_of_information.py in <module>()
ImportError: cannot import name linprog
In [2]: ex1 = dit.Distribution(['000', '001', '010', '011', '100', '101', '110', '111'], [1/8]*8)
In [3]: ex2 = dit.Distribution(['000', '111'], [1/2]*2)
In [4]: ex3 = dit.Distribution(['000', '001', '110', '111'], [1/4]*4)
In [5]: ex4 = dit.Distribution(['000', '011', '101', '110'], [1/4]*4)
The I-diagrams for these four examples can be computed thusly:
In [6]: from dit.algorithms import ShannonPartition
In [7]: print(ShannonPartition(ex1))
+----------+--------+
| measure | bits |
+----------+--------+
| H[2|0,1] | 1.000 |
| H[1|0,2] | 1.000 |
| H[0|1,2] | 1.000 |
| I[1:2|0] | 0.000 |
| I[0:1|2] | 0.000 |
| I[0:2|1] | 0.000 |
| I[0:1:2] | 0.000 |
+----------+--------+
In [8]: print(ShannonPartition(ex2))
+----------+--------+
| measure | bits |
+----------+--------+
| H[2|0,1] | 0.000 |
| H[1|0,2] | 0.000 |
| H[0|1,2] | 0.000 |
| I[1:2|0] | 0.000 |
| I[0:1|2] | 0.000 |
| I[0:2|1] | 0.000 |
| I[0:1:2] | 1.000 |
+----------+--------+
In [9]: print(ShannonPartition(ex3))
+----------+--------+
| measure | bits |
+----------+--------+
| H[2|0,1] | 1.000 |
| H[1|0,2] | 0.000 |
| H[0|1,2] | 0.000 |
| I[1:2|0] | 0.000 |
| I[0:1|2] | 1.000 |
| I[0:2|1] | 0.000 |
| I[0:1:2] | 0.000 |
+----------+--------+
In [10]: print(ShannonPartition(ex4))
+----------+--------+
| measure | bits |
+----------+--------+
| H[2|0,1] | 0.000 |
| H[1|0,2] | 0.000 |
| H[0|1,2] | 0.000 |
| I[1:2|0] | 1.000 |
| I[0:1|2] | 1.000 |
| I[0:2|1] | 1.000 |
| I[0:1:2] | -1.000 |
+----------+--------+
The complexity profile is simply the amount of information at scale \(\geq k\) of each “layer” of the I-diagram [BY04].
Consider example 1, which contains three independent bits. Each of these bits are in the outermost “layer” of the i-diagram, and so the information in the complexity profile is all at layer 1:
In [11]: ComplexityProfile(ex1).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-850d1430a4c3> in <module>()
----> 1 ComplexityProfile(ex1).draw()
NameError: name 'ComplexityProfile' is not defined
Whereas in example 2, all the information is in the center, and so each scale of the complexity profile picks up that one bit:
In [12]: ComplexityProfile(ex2).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-51ab6ae2e99f> in <module>()
----> 1 ComplexityProfile(ex2).draw()
NameError: name 'ComplexityProfile' is not defined
Both bits in example 3 are at a scale of at least 1, but only the shared bit persists to scale 2:
In [13]: ComplexityProfile(ex3).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-6311c54eb5f5> in <module>()
----> 1 ComplexityProfile(ex3).draw()
NameError: name 'ComplexityProfile' is not defined
Finally, example 4 (where each variable is the exclusive or
of the other two):
In [14]: ComplexityProfile(ex4).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-d42e22fe7b80> in <module>()
----> 1 ComplexityProfile(ex4).draw()
NameError: name 'ComplexityProfile' is not defined
The marginal utility of information (MUI) [ASBY14] takes a different approach. It asks, given an amount of information \(\I[d : \{X\}] = y\), what is the maximum amount of information one can extract using an auxilliary variable \(d\) as measured by the sum of the pairwise mutual informations, \(\sum \I[d : X_i]\). The MUI is then the rate of this maximum as a function of \(y\).
For the first example, each bit is independent and so basically must be extracted independently. Thus, as one increases \(y\) the maximum amount extracted grows equally:
In [15]: MUIProfile(ex1).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-a5377e16ac48> in <module>()
----> 1 MUIProfile(ex1).draw()
NameError: name 'MUIProfile' is not defined
In the second example, there is only one bit total to be extracted, but it is shared by each pairwise mutual information. Therefore, for each increase in \(y\) we get a threefold increase in the amount extracted:
In [16]: MUIProfile(ex2).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-16-da24b4abf7f6> in <module>()
----> 1 MUIProfile(ex2).draw()
NameError: name 'MUIProfile' is not defined
For the third example, for the first one bit of \(y\) we can pull from the shared bit, but after that one must pull from the independent bit, so we see a step in the MUI profile:
In [17]: MUIProfile(ex3).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-17-35d7522e2d62> in <module>()
----> 1 MUIProfile(ex3).draw()
NameError: name 'MUIProfile' is not defined
Lastly, the xor
example:
In [18]: MUIProfile(ex4).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-e9bb2bcb7e49> in <module>()
----> 1 MUIProfile(ex4).draw()
NameError: name 'MUIProfile' is not defined
Also known as the connected information or network informations, the Schneidman profile exposes how much information is learned about the distribution when considering \(k\)-way dependencies [SSB+03]. In all the following examples, each individual marginal is already uniformly distributed, and so the connected information at scale 1 is 0.
In the first example, all the random variables are independent already, so fixing marginals above \(k=1\) does not result in any change to the inferred distribution:
In [19]: SchneidmanProfile(ex1).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-0f650431dd1c> in <module>()
----> 1 SchneidmanProfile(ex1).draw()
NameError: name 'SchneidmanProfile' is not defined
In the second example, by learning the pairwise marginals, we reduce the entropy of the distribution by two bits (from three independent bits, to one giant bit):
In [20]: SchneidmanProfile(ex2).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-0844f957a1c0> in <module>()
----> 1 SchneidmanProfile(ex2).draw()
NameError: name 'SchneidmanProfile' is not defined
For the third example, learning pairwise marginals only reduces the entropy by one bit:
In [21]: SchneidmanProfile(ex3).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-21-45907c62856b> in <module>()
----> 1 SchneidmanProfile(ex3).draw()
NameError: name 'SchneidmanProfile' is not defined
And for the xor
, all bits appear independent until fixing the three-way marginals at which point one bit about the distribution is learned:
In [22]: SchneidmanProfile(ex4).draw()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-22-5e149133a64d> in <module>()
----> 1 SchneidmanProfile(ex4).draw()
NameError: name 'SchneidmanProfile' is not defined
[AP12] | Samer A Abdallah and Mark D Plumbley. A measure of statistical complexity based on predictive information with application to finite spin systems. Physics Letters A, 376(4):275–281, January 2012. doi:10.1016/j.physleta.2011.10.066. |
[ASBY14] | B. Allen, B. C. Stacey, and Y. Bar-Yam. An information-theoretic formalism for multiscale structure in complex systems. arXiv preprint arXiv:1409.4708, 2014. |
[BY04] | Y. Bar-Yam. Multiscale complexity/entropy. Advances in Complex Systems, 7(01):47–63, 2004. |
[Bel03] | Anthony J Bell. The Co-Information Lattice. Springer, New York, ica 2003 edition, 2003. |
[Cal02] | Cristian Calude. Information and Randomness: An Algorithmic Perspective. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2nd edition, 2002. ISBN 3540434666. |
[CT06] | Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley-Interscience, New York, second edition, 2006. ISBN 0471241954. |
[GCJ+13] | Virgil Griffith, Edwin KP Chong, Ryan G James, Christopher J Ellison, and James P Crutchfield. Intersection information based on common randomness. arXiv preprint arXiv:1310.1538, 2013. |
[GacsKorner73] | Peter Gács and János Körner. Common information is far less than mutual information. Problems of Control and Information Theory, 2(2):149–162, 1973. |
[Han80] | Te Sun Han. Multiple mutual informations and multiple interactions in frequency data. Information and Control, 46(1):26–45, July 1980. doi:10.1016/S0019-9958(80)90478-7. |
[LSAgro11] | Frank Lad, Giuseppe Sanfilippo, and Gianna Agrò. Extropy: a complementary dual of entropy. arXiv preprint arXiv:1109.6440, 2011. |
[Mac03] | David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003. |
[RCVW04] | Murali Rao, Yunmei Chen, Baba C Vemuri, and Fei Wang. Cumulative residual entropy: a new measure of information. Information Theory, IEEE Transactions on, 50(6):1220–1228, 2004. |
[SSB+03] | E. Schneidman, S. Still, M. J. Berry, W. Bialek, and others. Network information and connected correlations. Phys. Rev. Lett., 91(23):238701, 2003. |
[TSE94] | Giulio Tononi, Olaf Sporns, and Gerald M Edelman. A measure for brain complexity: relating functional segregation and integration in the nervous system. Proceedings of the National Academy of Sciences, 91(11):5033–5037, 1994. |
[VW08] | Sergio Verdu and Tsachy Weissman. The information lost in erasures. Information Theory, IEEE Transactions on, 54(11):5030–5058, 2008. |
[Wat60] | Satosi Watanabe. Information Theoretical Analysis of Multivariate Correlation. IBM Journal of Research and Development, 4(1):66–82, January 1960. doi:10.1147/rd.41.0066. |
[Yeu08] | Raymond W Yeung. Information theory and network coding. Springer, 2008. |
six
for python 2/3 compatibility.