Welcome to histlite!

histlite is a histogram calculation and plotting library that is “lite” on data structures but rich in statistics and visualization features.

About this project

Introduction

histlite is a histogram calculation and plotting library that tries to be “lite” on data structures but rich in statistics and visualization features. So far, development has taken place during my (Mike Richman) time as a graduate student and post-doctoral researcher in the field of particle astrophysics — specifically, working with the IceCube Neutrino Observatory. Histlite is intended both to facilitate high-paced exploratory data analysis as well as to serve as a building block for potentially very complex maximum likelihood data analysis implementations.

The core design considerations are:

  • It must be trivial to work with and interchange between 1D, 2D, or ND histograms.
  • It should be as simple as possible to perform bin-wise arithmetic operations on one or more histograms; to perform sums, integrals, etc. and thus normalizations along one or more axes simultaneously; and to perform spline or user-defined functional fits
  • It should be as simple as possible to achieve publication-quality plots.

The primary histogramming functionality consists of a thin wrapper around numpy.histogramdd(). Statistical tools leverage scipy but include custom solutions for some use cases. (Importantly, error propagation is currently handled manually but may be migrated to the uncertainties package in the future.) Plotting is done using matplotlib.

Motivation

Many problems in statistics involve understanding the behavior of distributions of observables. Histograms are one of the best-understood and most commonly used tools for discerning the properties of those distributions. A key reason for preferring histograms over other tools such as KDE is that per-bin error (or uncertainty) estimates are robust and straightforward to reason about.

Numpy is a fantastic and an extremely general numerical library. Scipy provides many excellent statistical methods. Matplotlib offers beautiful and highly flexible visualizations suitable for publication-quality plots. However, as far as I know, to date there is no standard, widely used histogramming package for synthesizing functionality from these baseline tools.

Numpy provides reasonable (if not particularly optimal) methods for producing histograms in the form of (values,edges). Matplotlib is capable of plotting histograms, but that is about as far as it goes. For applications that require publication-quality plots only after non-trivial histogram combinations or manipulations, matplotlib offers little support. Scipy methods are happy to operate on bin centers and weights, but you’ll have to keep track of those yourself.

This package is my attempt to simplify histogram calculation and plotting operations, primarily by bringing together functionality that is currently spread across the scientific Python software stack.

Alternatives

Alternatives include at least the following:

  • plain numpy/matplotlib/scipy: You can get done what you want to get done, but you’ll almost certainly violate the DRY (don’t repeat yourself) guideline.
  • physt: I only very recently became aware of this library which seeks to solve generally the same problems. Perhaps someday we will join forces. What I gather from the physt documentation is that it places much more emphasis on flexible (even optionally automatically-optimized) binning, but much less emphasis on working with histograms as probability density functions (PDFs) as required by much of my work.
  • ROOT: Many particle physicists use this CERN product for data analysis, but… to be brief, even with its Python bindings it will remain unacceptable for serious work, especially but not only outside of particle physics, until it is rewritten or replaced in its entirety.

Features

  • ND histograms: all histograms are N-dimensional, which keeps things simple and generic in the case of operations that involve mixed dimensionality (the same as numpy broadcasting).
  • easy, flexible plotting: we don’t cover every use case, but you have sane defaults and fine-grained control over 1D and 2D histogram presentation.
  • histogram arithmetic: histograms can be added, subtracted, multiplied, divided, log’d, sqrt’d, raised to powers, etc.
  • normalizations and projections: histograms can be normalized to integrate to 1; integrate to 1 accounting for varying bin volume (density=True); or sum to (integrate=False). Histograms can be projected to lower dimensionality by integrating or summing over one or more axes; or by computing quantiles along a specified axis; or by finding a specified containment interval along a specified axis, centered on the median and expressing the containment interval as errorbars.
  • error/uncertainty tracking per-bin errors are tracked in all but a handful of cases where their meaning is unclear (e.g. uncertainty on a quantile computed from some oother histogram). Errors are propagated in the standard way under histogram arithmetic.
  • simple (“lite”) data model: a histlite.Hist is really just any curve with both values and errors defined spanning some specified bins. You don’t need to think about whether (h1 - h2) / (h1 + h2) is a histogram or some other thing; all the same methods, operators, and plotting functions will work the same.
  • fitting interface: 1D and 2D (ND still a work in progress) splines can be computed easily, producing histlite.SplineFit functors that know whether the underlying scipy splines were fitted in linear or log space on each axis. For convenience, arbitrary functions can be fit analogously to scipy.optimize.curve_fit(), using that function internally.
  • evaluation interface: histlite.Hist’s can be generated by evaluating arbitrary functions, or instances of histlite.Hist or histlite.SplineFit, on a specified grid. The resulting histlite.Hist’s can then of course be used in histogram arithmetic or plotting methds.
  • smoothing: histograms can be smoothed using histlite.Hist.gaussian_filter() or histlite.Hist.gaussian_filter1d(). Smooth “histograms” can be used to approximate Kernel Density Estimation (KDE) using histlite.kde().

Installing histlite

As of June, 2019, histlite is hosted on github, with releases accessible via PyPI. For a basic user installation, simply use:

# install from PyPI
pip install --user histlite

For a development install (i.e. you want to modify histlite yourself), navigate to an appropriate development directory (e.g. $HOME/src), and then:

# using SSH
git clone git@github.com:zgana/histlite.git
# using HTTPS
git clone https://github.com/zgana/histlite.git
# finally, install
pip install --user --editable ./histlite

Tutorials

Getting Started With Histlite

In this brief tutorial, we introduce the basics of creating, manipulating, and plotting histograms using histlite.

[1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import histlite as hl

%matplotlib inline
[2]:
# I prefer a reordered color cycle
from cycler import cycler
mpl_colors_orig = np.array (plt.matplotlib.rcParamsDefault['axes.prop_cycle'].by_key()['color'])
mpl_colors = mpl_colors_orig[[0, 3, 2, 1, 4, 5, 6, 7]]
plt.rc ('axes', prop_cycle=cycler ('color', mpl_colors))

First histogram

For our first histogram, let’s use some normally-distributed random data and default binning in one dimension:

[3]:
x = np.random.normal (size=10**4)
[4]:
h = hl.hist (x)
[5]:
fig, ax = plt.subplots ()
hl.plot1d (ax, h)
_images/getting_started_8_0.png

We immediately notice that for this type of data, it may be useful to select symmetric binning including a central bin at 0:

[6]:
h = hl.hist (x, bins=51, range=(-5, 5))
[7]:
fig, ax = plt.subplots ()
hl.plot1d (ax, h)
_images/getting_started_11_0.png

Histlite keeps track of per-bin error estimates, and we can display them with many different styles. Let’s try plotting with errorbars, errorbands, errorbands with smooth lines rather than steps, and crosses:

[8]:
fig, ax = plt.subplots (2, 2, figsize=(8, 8))
hl.plot1d (ax[0][0], h, errorbars=True)
hl.plot1d (ax[0][1], h, errorbands=True)
hl.plot1d (ax[1][0], h, errorbands=True, drawstyle='default')
hl.plot1d (ax[1][1], h, crosses=True)
_images/getting_started_13_0.png

You have probably noticed that the plotting style differs from matplotlib’s standard:

[9]:
fig, ax = plt.subplots (1, 2, figsize=(8, 4))
hl.plot1d (ax[0], h)
ax[1].hist (x, bins=h.bins[0]);
_images/getting_started_15_0.png

As we will see, much of the design of histlite is centered on simply dealing with binned rather than smooth curves. However, the classic matplotlib style (and more) can still be obtained:

[10]:
fig, ax = plt.subplots (1, 2, figsize=(8, 4))
hl.fill_between (ax[0], 0, h)
ax[0].set_ylim (0)
ax[1].hist (x, bins=h.bins[0]);
_images/getting_started_17_0.png

Data model

The “light” in histlite refers to the deliberately simple data structure. A histlite histogram is essentially just a bundle of bins, values, and error estimates. These are the required initialization arguments which are determined from data and arguments to factory functions such as histlite.hist(). They are also read-only properties of each Hist:

[11]:
print (h.bins)
print (h.values)
print (h.errors)
[array([-5.        , -4.80392157, -4.60784314, -4.41176471, -4.21568627,
       -4.01960784, -3.82352941, -3.62745098, -3.43137255, -3.23529412,
       -3.03921569, -2.84313725, -2.64705882, -2.45098039, -2.25490196,
       -2.05882353, -1.8627451 , -1.66666667, -1.47058824, -1.2745098 ,
       -1.07843137, -0.88235294, -0.68627451, -0.49019608, -0.29411765,
       -0.09803922,  0.09803922,  0.29411765,  0.49019608,  0.68627451,
        0.88235294,  1.07843137,  1.2745098 ,  1.47058824,  1.66666667,
        1.8627451 ,  2.05882353,  2.25490196,  2.45098039,  2.64705882,
        2.84313725,  3.03921569,  3.23529412,  3.43137255,  3.62745098,
        3.82352941,  4.01960784,  4.21568627,  4.41176471,  4.60784314,
        4.80392157,  5.        ])]
[  0.   0.   0.   0.   0.   1.   1.   1.   4.  11.  15.  27.  24.  38.
  75. 132. 159. 230. 313. 386. 461. 542. 641. 706. 777. 788. 818. 700.
 666. 619. 448. 381. 325. 240. 166. 108.  85.  39.  29.  23.  10.   4.
   3.   3.   0.   0.   1.   0.   0.   0.   0.]
[ 0.          0.          0.          0.          0.          1.
  1.          1.          2.          3.31662479  3.87298335  5.19615242
  4.89897949  6.164414    8.66025404 11.48912529 12.60952021 15.16575089
 17.69180601 19.6468827  21.47091055 23.28089345 25.3179778  26.57066051
 27.87471973 28.0713377  28.60069929 26.45751311 25.8069758  24.87971061
 21.16601049 19.5192213  18.02775638 15.49193338 12.88409873 10.39230485
  9.21954446  6.244998    5.38516481  4.79583152  3.16227766  2.
  1.73205081  1.73205081  0.          0.          1.          0.
  0.          0.          0.        ]

Note that bins is a list of per-axis ndarrays. Here we have a 1D histogram, and so a single axis, but in general histlite histograms are N-dimensional. Note also that the errors are \(\sqrt{N}\) per bin for unweighted histograms, which is a special case for a general implementation where the errors are \(\sqrt{\left(\sum_i w_i^2\right)}\) for events \(\{i\}\) per bin for weighted histograms.

On construction, it is determined whether bins are log-spaced (which can be conveniently requested of histlite.hist(), as we will see later). This information is exposed as an array of per-axis bools:

[12]:
print (h.log)
[False]

We also access to the per-axis number of bins, binning ranges, and bin centers:

[13]:
print (h.n_bins)
print (h.range)
print (h.centers)
[51]
[(-5.0, 5.0)]
[array([-4.90196078, -4.70588235, -4.50980392, -4.31372549, -4.11764706,
       -3.92156863, -3.7254902 , -3.52941176, -3.33333333, -3.1372549 ,
       -2.94117647, -2.74509804, -2.54901961, -2.35294118, -2.15686275,
       -1.96078431, -1.76470588, -1.56862745, -1.37254902, -1.17647059,
       -0.98039216, -0.78431373, -0.58823529, -0.39215686, -0.19607843,
        0.        ,  0.19607843,  0.39215686,  0.58823529,  0.78431373,
        0.98039216,  1.17647059,  1.37254902,  1.56862745,  1.76470588,
        1.96078431,  2.15686275,  2.35294118,  2.54901961,  2.74509804,
        2.94117647,  3.1372549 ,  3.33333333,  3.52941176,  3.7254902 ,
        3.92156863,  4.11764706,  4.31372549,  4.50980392,  4.70588235,
        4.90196078])]

Working with multiple histograms

Let’s generate data for a second histogram:

[14]:
x2 = np.random.normal (loc=1, scale=2, size=int (x.size/2))

We would like a second histogram, h2, to be used with the original one, h. For this, we require the binning to match. One way this can be achieved is to bin x2 using h.bins:

[15]:
h2 = hl.hist (x2, bins=h.bins)
[16]:
fig, ax = plt.subplots ()
hl.plot1d (ax, h)
hl.plot1d (ax, h2)
_images/getting_started_31_0.png

This can also be expressed as follows:

[17]:
h2 = hl.hist_like (h, x2)
[18]:
fig, ax = plt.subplots ()
hl.plot1d (ax, h)
hl.plot1d (ax, h2)
_images/getting_started_34_0.png

We can perform a number of intuitive operations using these two histograms. The most obvious application is to sum them:

[19]:
fig, axs = plt.subplots (1, 2, figsize=(8, 4))

ax = axs[0]
hl.plot1d (ax, h)
hl.plot1d (ax, h2)
hl.plot1d (ax, h + h2, color='k')

ax = axs[1]
hl.stack1d (ax, [h, h2], colors='C0 C1'.split (), alpha=.67, lw=0)
hl.plot1d (ax, h + h2, color='k')
_images/getting_started_36_0.png

We can also create more elaborate constructs — and standard error propagation will be applied throughout:

[20]:
fig, ax = plt.subplots ()
hl.plot1d (ax, h / (h + h2), crosses=True)
_images/getting_started_38_0.png

An observant reader may notice that in this particular case, the errors on h and h + h2 are correlated, and thus naive error propagation overestimates the errors. Histlite provides a convenient way to obtain error estimates with proper coverage for the special case of \(\text{denominator} = \text{numerator} + \text{something else}\):

[21]:
fig, ax = plt.subplots ()
hl.plot1d (ax, h.efficiency (h + h2), crosses=True)
_images/getting_started_40_0.png

We also see from the above examples that histlite.Hist is not strictly required to be meaningfully interpretted as a “histogram” in the classic sense. For example, it is perfectly happy with negative values:

[22]:
fig, ax = plt.subplots ()
hl.plot1d (ax, h - h2, crosses=True)
_images/getting_started_42_0.png

One can imagine constructing a highly descriptive class hierarchy to represent all manner of data structures that boil down to “values and errors per bin”. For histlite, we made the design decision to represent all of these with histlite.Hist to avoid the sort of complexity we see in, for example, integer to float promotion under division.

First 2D histogram

As mentioned above, histlite is fully generalized to N-dimensions throughout, with no special-casing for specific numbers of dimensions. Let’s try constructing a 2D histogram:

[23]:
y = np.random.normal (loc=0, scale=2, size=x.size)
[24]:
hh = hl.hist ((x, y), bins=51, range=(-5, 5))

We can plot this histogram, hh, easily like so:

[25]:
fig, ax = plt.subplots ()
ax.set_aspect ('equal')
hl.plot2d (ax, hh)
[25]:
<matplotlib.collections.QuadMesh at 0x7fbb357285c0>
_images/getting_started_49_1.png

We have a number of options for customizing this plot. Let’s try adding a colorbar, log-scaling the colormap, and specifying the color limits:

[26]:
fig, ax = plt.subplots ()
ax.set_aspect ('equal')
hl.plot2d (ax, hh, cbar=True, log=True, vmin=1, vmax=1e2)
[26]:
{'colormesh': <matplotlib.collections.QuadMesh at 0x7fbb35681358>,
 'colorbar': <matplotlib.colorbar.Colorbar at 0x7fbb356a7940>}
_images/getting_started_51_1.png

Unlike the histlite.plot1d() case, it is often convenient to customize the underlying matplotlib objects after histlite.plot2d() is finished. For example, we may want to label the colorbar. We can do this by digging into the dictionary returned by histlite.plot2d():

[27]:
fig, ax = plt.subplots ()
ax.set_aspect ('equal')
d = hl.plot2d (ax, hh, cbar=True, log=True, vmin=1, vmax=1e2)
d['colorbar'].set_label ('counts')
_images/getting_started_53_0.png

Brief survey of additional features

In this tutorial, we try to cover the basics that will let you start making plots as quickly as possible. Histlite has much more to offer. We will cover the details in other notebooks, but here we offer a teaser with little to no explanation.

Normalization
[28]:
fig, ax = plt.subplots ()
hn = h.normalize ()
print (hn)
print ('integral of hn is', hn.integrate ().values)
hl.plot1d (ax, hn, crosses=True)
Hist(51 bins in [-5.0,5.0]) with sum 5.1, 11 empty bins, and 0 non-finite values
integral of hn is 1.0
_images/getting_started_57_1.png
[29]:
fig, ax = plt.subplots ()
ax.set_aspect ('equal')
hl.plot2d (ax, hh.normalize (), cbar=True)
[29]:
{'colormesh': <matplotlib.collections.QuadMesh at 0x7fbb35c7a9e8>,
 'colorbar': <matplotlib.colorbar.Colorbar at 0x7fbb35937320>}
_images/getting_started_58_1.png
Pseudo-KDE (kernel density estimation)
[30]:
fig, (ax1, ax2) = plt.subplots (1, 2, figsize=(8, 4), gridspec_kw=dict (width_ratios=(2,1)))
hl.plot1d (ax1, hl.kde (x))
hl.plot2d (ax2, hl.kde ((x, y)), cbar=True)
ax2.set_aspect ('equal')
plt.tight_layout ()
_images/getting_started_60_0.png
Axis transformation
[31]:
fig, ax = plt.subplots ()
h_exp = h.transform_bins (lambda x: 10**x)
ax.semilogx ()
hl.plot1d (ax, h_exp, crosses=True)
_images/getting_started_62_0.png
Cumulative sums
[32]:
fig, (ax1, ax2) = plt.subplots (1, 2, figsize=(8, 4))
hl.plot1d (ax1, h.cumsum () / h.sum ())
hl.plot2d (ax2, hh.cumsum (), cbar=True)
plt.tight_layout ()
_images/getting_started_64_0.png
Projections
[33]:
fig, (ax1, ax2, ax3) = plt.subplots (1, 3, figsize=(12,4))
hl.plot1d (ax1, hh.project ([0]))
hl.plot1d (ax1, h, ls=':') # slightly more elements since some y's were under/overflow

for frac in (.16, .5, .84):
    hl.plot1d (ax2, hh.contain (-1, frac))

hl.plot1d (ax3, hh.contain_project (-1),  # 1σ errors by default
           drawstyle='default', errorbands=True)
_images/getting_started_66_0.png
[34]:
fig, ax = plt.subplots ()
x3 = np.random.uniform (-5, 5, size=10**4)
y3 = np.random.normal (loc=x3**2, scale=1 + np.abs (x3))
hl.plot1d (ax, hl.hist ((x3, y3), bins=100).contain_project (-1),
           errorbands=True, drawstyle='default')
_images/getting_started_67_0.png
Values transformation
[35]:
fig, ax = plt.subplots ()
kw = dict (crosses=True)
hl.plot1d (ax, h**2, label='squared', **kw)
hl.plot1d (ax, h, label='original', color='k', lw=3, **kw)
hl.plot1d (ax, h.sqrt (), label='sqrt', **kw)
hl.plot1d (ax, h.ln (), label='natural log', **kw)
hl.plot1d (ax, h.log10 (), label='log10', **kw)
ax.semilogy (nonposy='clip')
ax.legend ();
_images/getting_started_69_0.png
Transpose
[36]:
fig, ax = plt.subplots (1, 2, figsize=(8, 8))
hl.plot2d (ax[0], hh)
hl.plot2d (ax[1], hh.T)
for a in ax:
    a.set_aspect ('equal')
_images/getting_started_71_0.png
Smoothing
[37]:
fig, ax = plt.subplots (1, 2, figsize=(8, 8))
hl.plot2d (ax[0], hh)
hl.plot2d (ax[1], hh.gaussian_filter (1))
for a in ax:
    a.set_aspect ('equal')
_images/getting_started_73_0.png
Spline Fit
[38]:
fig, ax = plt.subplots ()
hl.plot1d (ax, h, crosses=True)

X = np.linspace (-5, 5, 1000)
s = h.spline_fit ()
ax.plot (X, s(X))  # naturally, overshoots in region without data...
[38]:
[<matplotlib.lines.Line2D at 0x7fbb3580d710>]
_images/getting_started_75_1.png
Function fit
[39]:
fig, ax = plt.subplots ()
hl.plot1d (ax, hn, crosses=True)

params, cov = hn.curve_fit (lambda x, loc, scale: stats.norm.pdf (x, loc, scale))
ax.plot (X, stats.norm.pdf (X, *params))
[39]:
[<matplotlib.lines.Line2D at 0x7fbb358a2630>]
_images/getting_started_77_1.png
Slicing in binning units
[40]:
fig, ax = plt.subplots (1, 2, figsize=(8, 4))
hl.plot1d (ax[0], h[0:])
hl.plot1d (ax[0], hh[0])
hl.plot2d (ax[1], hh[0:,0:])
[40]:
<matplotlib.collections.QuadMesh at 0x7fbb35849160>
_images/getting_started_79_1.png

PDFs and Normalization

In this tutorial, we discuss the treatment of histograms as probality density functions (PDFs).

[1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import histlite as hl

%matplotlib inline
[2]:
# I prefer a reordered color cycle
from cycler import cycler
mpl_colors_orig = np.array (plt.matplotlib.rcParamsDefault['axes.prop_cycle'].by_key()['color'])
mpl_colors = mpl_colors_orig[[0, 3, 2, 1, 4, 5, 6, 7]]
plt.rc ('axes', prop_cycle=cycler ('color', mpl_colors))

Histogram as PDF

We begin by designing artificial “data” distributions with which to test various features. First, a non-trivial 1D distribution:

[3]:
props1d = dict(loc=np.r_[20, 33, 60], scale=np.r_[5, 8, 13], norm=np.r_[.2, .3, 0.5])
[4]:
def pdf1d(x):
    pdf0 = props1d['norm'][0] * stats.norm.pdf(x, props1d['loc'][0], props1d['scale'][0])
    pdf1 = props1d['norm'][1] * stats.norm.pdf(x, props1d['loc'][1], props1d['scale'][1])
    pdf2 = props1d['norm'][2] * stats.norm.pdf(x, props1d['loc'][2], props1d['scale'][2])
    return pdf0 + pdf1 + pdf2
[5]:
fig, ax = plt.subplots()
x = np.linspace(0, 100, 300)
ax.plot(x, pdf1d(x), 'k:')
ax.set_xlabel('$x$')
ax.set_ylabel('PDF');
_images/pdfs_and_normalization_8_0.png

We can obtain random numbers distributed according to this distribution like so:

[6]:
def random1d(N):
    loc = props1d['loc']
    scale = props1d['scale']
    norm = props1d['norm']
    which = np.random.choice(3, p=norm, size=N)
    return np.random.normal(loc[which], scale[which])
[7]:
np.random.seed(1)
X = random1d(10000)

The histogram of these values looks like so:

[8]:
h1d = hl.hist(X, bins=50, range=(0, 100))
h1d
[8]:
Hist(50 bins in [0.0,100.0], with sum 9993.0, 1 empty bins, and 0 non-finite values)
[9]:
fig, ax = plt.subplots()
hl.plot1d(h1d, crosses=True)
ax.set_xlabel('$x$')
ax.set_ylabel('counts per bin');
_images/pdfs_and_normalization_14_0.png

We can obtain a normalized histogram like so:

[10]:
p1d = h1d.normalize()
[11]:
fig, ax = plt.subplots()
hl.plot1d(p1d, crosses=True)
ax.set_xlabel('$x$')
ax.set_ylabel('PDF');
_images/pdfs_and_normalization_17_0.png

The difference between h1d and p1d here is simply a scalar factor; p1d represents the PDF \(p(x)\). We can compare the integrals of each histogram:

[12]:
print(h1d.integrate())
print(p1d.integrate())
Hist(with value 19986.0)
Hist(with value 1.0)

Notice that integrating p1d yields a value of 1. Notice also that the result of the integral is still a Hist instance, similar to certain dimensionality-agnostic numpy operations which yield 0-dimensional ndarray instances. This generality is useful because, as we will see below, integration and normalization can be performed along specified axes.

We can also compute the probability mass function (PMF) by normalizing such that the sum, rather than integral, will be unity:

[13]:
pmf1d = h1d.normalize(integrate=False)
pmf1d
[13]:
Hist(50 bins in [0.0,100.0], with sum 1.0, 1 empty bins, and 0 non-finite values)

So while the value for each bin in p1d represents the probability density across that bin, \(p(x)\), the values of pmf1d represent the total probability within each bin, \(\text{pmf}(i)=\int_{x_i}^{x_{i+1}}p(x)\,dx\).

We can see that the normalized histogram is a decent representation of the underlying distribution:

[14]:
fig, ax = plt.subplots()
x = np.linspace(0, 100, 300)
ax.plot(x, pdf1d(x), 'k:', label='True PDF')
hl.plot1d(p1d, crosses=True, label='50 bins')
ax.set_xlabel('$x$')
ax.set_ylabel('PDF')
ax.legend();
_images/pdfs_and_normalization_25_0.png

PDF Modeling

We might be interested in inferring a smooth PDF given a histogram of some real (in this case, “real”) data. One simple solution is to fit a spline:

[15]:
s1d = p1d.spline_fit()
[16]:
fig, ax = plt.subplots()
x = np.linspace(0, 100, 300)
ax.plot(x, pdf1d(x), 'k:', label='True PDF')
hl.plot1d(p1d, crosses=True, label='50 bins')
ax.plot(x, s1d(x), label='Spline Fit')
ax.set_xlabel('$x$')
ax.set_ylabel('PDF')
ax.legend();
_images/pdfs_and_normalization_29_0.png

What is s1d?

[17]:
type(s1d)
[17]:
histlite.SplineFit

histlite spline fits are thin wrappers around scipy.interpolate objects. In this case, we have a UnivariateSpline:

[18]:
s1d.spline
[18]:
<scipy.interpolate.fitpack2.UnivariateSpline at 0x7f9b7a537278>

One nice feature of this wrapper is that it is readily interoperable with ordinary Hist objects:

[19]:
fig, ax = plt.subplots()
hl.plot1d(s1d.eval(bins=p1d.bins) / p1d, crosses=True)
ax.axhline(1, color='k', ls=':', zorder=-1)
ax.set_xlabel('$x$')
ax.set_ylabel('spline/hist');
_images/pdfs_and_normalization_35_0.png

It can sometimes be convenient to fit the spline in log-value space, which can be very handy e.g. when we know the underlying function is strictly positive. Meanwhile, we also have access to the constructor kwargs for the internally-used scipy.interpolate object:

[20]:
s1d_pos = p1d.spline_fit(log=True, s=30)
[21]:
fig, ax = plt.subplots()
x = np.linspace(0, 100, 300)
ax.plot(x, pdf1d(x), 'k:', label='True PDF')
hl.plot1d(p1d, crosses=True, label='50 bins')
ax.plot(x, s1d_pos(x), label='Spline Fit (log-value fit)')
ax.set_xlabel('$x$')
ax.set_ylabel('PDF')
ax.legend();
_images/pdfs_and_normalization_38_0.png

We might, however, notice (or perhaps know a priori) that the function is in fact a combination of three Gaussians. Then we can perform a fit to an analytic model:

[22]:
def model_p1d(x, m0, m1, m2, s0, s1, s2, f0, f1):
    f2 = 1 - f0 - f1
    pdf0 = f0 * stats.norm.pdf(x, m0, s0)
    pdf1 = f1 * stats.norm.pdf(x, m1, s1)
    pdf2 = f2 * stats.norm.pdf(x, m2, s2)
    return pdf0 + pdf1 + pdf2
[23]:
params1d, cov = p1d.curve_fit(model_p1d, p0=(20, 40, 60, 5, 5, 5, .3, .3))
params1d
[23]:
array([19.70170103, 32.76340277, 60.96181126,  4.83091852,  9.26921554,
       12.88940782,  0.16090333,  0.36418666])

How do these compare with the true values?

[24]:
print(params1d)
true_params1d = np.r_[props1d['loc'], props1d['scale'], props1d['norm'][:-1]]
print(true_params1d)
print(params1d / true_params1d)
[19.70170103 32.76340277 60.96181126  4.83091852  9.26921554 12.88940782
  0.16090333  0.36418666]
[20.  33.  60.   5.   8.  13.   0.2  0.3]
[0.98508505 0.99283039 1.01603019 0.9661837  1.15865194 0.99149291
 0.80451667 1.21395554]

And how does the model fit actually look?

[25]:
fig, ax = plt.subplots()
x = np.linspace(0, 100, 300)
ax.plot(x, pdf1d(x), 'k:', label='True PDF')
hl.plot1d(p1d, crosses=True, label='50 bins')
ax.plot(x, model_p1d(x, *params1d), label='Model Fit')
ax.set_xlabel('$x$')
ax.set_ylabel('PDF')
ax.legend();
_images/pdfs_and_normalization_45_0.png

Finally, note that we can often get a reasonable representation of a distribution using a sort of approximate KDE provided by histlite:

[26]:
# kernel expressed as fraction of axis range
k1d = hl.kde(X, kernel=.02)
k1d
[26]:
Hist(1000 bins in [2.820370949977928,106.97260109208092], with sum 9.601330654520044, 0 empty bins, and 0 non-finite values)
[27]:
fig, ax = plt.subplots()
x = np.linspace(0, 100, 300)
ax.plot(x, pdf1d(x), 'k:', label='True PDF')
hl.plot1d(p1d, crosses=True, label='50 bins')
hl.plot1d(k1d, label='KDE')
ax.set_xlabel('$x$')
ax.set_ylabel('PDF')
ax.legend();
_images/pdfs_and_normalization_48_0.png

The approximate KDE implemented by histlite is achieved by starting with a densely binned histogram and then using scipy.ndimage to apply a Gaussian smoothing with a kernel \(\sigma\) expressed as a percent of the axis range — and then finally (by default) normalized to integrate to unity.

It is up to you, of course, to determine the best way to model your distributions. Let’s wrap up this section by comparing the various models we came up with — by the way, notice that Hist instances are callable, accepting one positional argument per dimension, returning the value of the corresponding bin(s):

[28]:
fig, (ax, rax) = plt.subplots(2, 1, figsize=(6,5), sharex=True, gridspec_kw=dict(height_ratios=[2,1]))
x = np.linspace(0, 100, 300)

# with multiple axes in play, we need to specify ax (so, not rax) here
hl.plot1d(ax, p1d, color='k', crosses=True, label='50 bins')
ax.plot(x, s1d(x), label='Spline Fit')
ax.plot(x, s1d_pos(x), label='Spline Fit (log-value fit)')
ax.plot(x, model_p1d(x, *params1d), label='Model Fit')
ax.plot(x, k1d(x), label='KDE')
ax.plot(x, pdf1d(x), 'k:', label='True PDF')

rax.plot(x, s1d(x) / pdf1d(x))
rax.plot(x, model_p1d(x, *params1d) / pdf1d(x))
rax.plot(x, k1d(x) / pdf1d(x))
rax.axhline(1, color='k', ls=':', zorder=-1)

ax.legend(ncol=2)
ax.set_ylim(-0.001, .035)
rax.set_ylim(.5, 1.5)
ax.set_ylabel('PDF')
rax.set_xlabel('$x$')
rax.set_ylabel(r'ratio wrt true');
_images/pdfs_and_normalization_51_0.png

PDF Sampling

For many applications, it is useful to be able to randomly sample from a histogram. You may specify a seed for the random number generator (this uses a local random state and does not affect the behavior of np.random.* functions). The histogram need not, but may, already be normalized in some way:

[29]:
N, seed = 10, 1
print(h1d.sample(N, seed=seed))
print(p1d.sample(N, seed=seed))
print(pmf1d.sample(N, seed=seed))
[array([36.83838903, 59.370439  ,  2.4089045 , 31.75623487, 20.05477519,
       19.34093502, 22.8346096 , 33.11737966, 34.28077388, 44.39620298])]
[array([36.83838903, 59.370439  ,  2.4089045 , 31.75623487, 20.05477519,
       19.34093502, 22.8346096 , 33.11737966, 34.28077388, 44.39620298])]
[array([36.83838903, 59.370439  ,  2.4089045 , 31.75623487, 20.05477519,
       19.34093502, 22.8346096 , 33.11737966, 34.28077388, 44.39620298])]

If you already have a np.random.RandomState instance, it can also be used:

[30]:
random = np.random.RandomState(seed)
print(h1d.sample(N, random=random))
[array([36.83838903, 59.370439  ,  2.4089045 , 31.75623487, 20.05477519,
       19.34093502, 22.8346096 , 33.11737966, 34.28077388, 44.39620298])]

Notice that, similar to Hist.bins, the return here is a sequence; in the case of higher dimensionality histograms, the result is one array per dimension.

At present, there is no implementation for randomly sampling directly from a SplineFit. However, sampling is of course possible after evaluating the spline on a grid:

[31]:
print(s1d_pos.eval(bins=1e3).sample(N, seed=seed))
[array([36.74191945, 58.16852195,  2.42044522, 29.98781174, 21.60273876,
       18.76704675, 23.64173048, 32.55586898, 35.51403869, 44.71981015])]

And of course sampling also works for the densely binned pseudo-KDE:

[32]:
print(k1d.sample(N, seed=seed))
[array([36.81765802, 58.19657229,  3.1541218 , 30.09556091, 21.57062485,
       18.72134062, 23.59012797, 32.56194533, 35.53879283, 44.81435241])]

You may have noticed that the random samples are very similar too each other, even though the details of the distributions differ, as long as the same random state is used. This is because, internally, inverse transform sampling is used, with only a single random.uniform(0, 1, N) call driving the randomness. In this case, all the CDFs are very similar.

Speaking of CDFs, these are straightforward to compute and plot as well:

[33]:
fig, ax = plt.subplots()
hl.plot1d(pmf1d.cumsum(), color='k', label='50 bins')
hl.plot1d(k1d.cumsum(normalize=True), label='KDE')
ax.set_xlabel(r'$x$')
ax.set_ylabel('CDF')
ax.legend();
_images/pdfs_and_normalization_64_0.png
[34]:
from importlib import reload
reload(hl)
[34]:
<module 'histlite' from '/home/mike/src/histlite/histlite/__init__.py'>

2D Histograms

histlite provides all of the above functionality for 2D histograms — and then some! Let’s walk through each feature, using an 2D elliptical Gaussian as our example:

[35]:
np.random.seed(1)
N = 10000
X0 = np.random.normal(0, np.exp(1), N)
Y0 = np.random.normal(0, 1, N)
theta = np.radians(30)
X = X0 * np.cos(theta) - Y0 * np.sin(theta)
Y = X0 * np.sin(theta) + Y0 * np.cos(theta)
[36]:
h2d = hl.hist((X, Y), bins=50, range=((-6,6), (-6,6)))
h2d
[36]:
Hist(50 bins in [-6.0,6.0], 50 bins in [-6.0,6.0], with sum 9876.0, 1382 empty bins, and 0 non-finite values)
[37]:
fig, ax = plt.subplots()
hl.plot2d(h2d, cbar=True, clabel='counts per bin')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$');
_images/pdfs_and_normalization_70_0.png

Joint and Conditional PDFs

We can compute the joint PDF \(p(x,y)\):

[38]:
p2d = h2d.normalize()
[39]:
fig, ax = plt.subplots()
hl.plot2d(p2d, cbar=True, clabel='$p(x,y)$')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$');
_images/pdfs_and_normalization_74_0.png

Or the joint PMF:

[40]:
pmf2d = p2d.normalize(integrate=False)
[41]:
fig, ax = plt.subplots()
hl.plot2d(pmf2d, cbar=True, clabel='$\mathsf{pmf}(x,y)$')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$');
_images/pdfs_and_normalization_77_0.png

For multi-dimensional histograms, we can also work with conditional PDFs:

[42]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,4))
hl.plot2d(ax1, h2d.normalize(0), cbar=True, vmax=.5, clabel='$p(x|y)$')
hl.plot2d(ax2, h2d.normalize(1), cbar=True, vmax=.5, clabel='$p(y|x)$')
for ax in (ax1, ax2):
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
plt.tight_layout();
_images/pdfs_and_normalization_79_0.png

Depending on the application, the presence of empty bins in \(p(x|y)\) may be a problem. One way to address these is with the to_finite() method:

[43]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15,4))
hl.plot2d(ax1, h2d.normalize(0).to_finite(), cbar=True, vmax=.5,
          clabel='$p(x|y)$ (nan $\to$ 0)')
hl.plot2d(ax2, h2d.normalize(0).to_finite(nan=np.min), cbar=True, vmax=.5,
          clabel='$p(x|y)$ (nan $\to$ min of finite values)')
hl.plot2d(ax3, h2d.normalize(0).to_finite(nan=np.mean), cbar=True, vmax=.5,
          clabel='$p(x|y)$ (nan $\to$ mean of finite values)')
for ax in (ax1, ax2, ax3):
    ax.set_xlabel('$x$')
    ax.set_ylabel('$y$')
plt.tight_layout();
_images/pdfs_and_normalization_81_0.png

Alternatively, if you are sure you know what you’re doing, you can modify the values directly:

[44]:
fig, ax = plt.subplots()
p2d_x_y = h2d.normalize(0)
p2d_x_y.values[~np.isfinite(p2d_x_y.values)] = 0.2
hl.plot2d(p2d_x_y, cbar=True, vmax=.5,
          clabel='$p(x|y)$ (nan $\to$ 0.2)')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$');
_images/pdfs_and_normalization_83_0.png

Modeling

We have spline, curve fitting, and KDE here as well. Here’s a spline fit:

[45]:
s2d = p2d.spline_fit(s=.03, kx=2, ky=2)
[46]:
fig, ax = plt.subplots()
hl.plot2d(s2d.eval(bins=200), cbar=True, clabel='$p(x,y)$: spline')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$');
_images/pdfs_and_normalization_87_0.png

KDE:

[47]:
k2d = hl.kde((X,Y), range=p2d.range, kernel=.03)
[48]:
fig, ax = plt.subplots()
hl.plot2d(k2d, cbar=True, clabel='$p(x,y)$: spline')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$');
_images/pdfs_and_normalization_90_0.png

Analytical model:

[49]:
def model_p2d(XY, loc1, scale1, loc2, scale2, theta):
    X, Y = XY
    X0 = X * np.cos(theta) + Y * np.sin(theta)
    Y0 = -X * np.sin(theta) + Y * np.cos(theta)
    return stats.norm.pdf(X0, loc1, scale1) * stats.norm.pdf(Y0, loc2, scale2)
[50]:
params2d, cov = p2d.curve_fit(model_p2d, p0=(0, 3, 0, 1, np.radians(10)))
params2d
[50]:
array([0.09721215, 2.64869243, 0.01828913, 0.99518638, 0.5387843 ])
[51]:
np.degrees(params2d[-1])
[51]:
30.870066424450698

Now might be a good time to point out that we can also construct a histogram from a callable function. Let’s try it with this model, and then plot it:

[52]:
h_model_p2d = hl.hist_from_eval(
    (lambda x, y: model_p2d((x,y), *params2d)),
    vectorize=False, bins=k2d.bins)
[53]:
fig, ax = plt.subplots()
hl.plot2d(h_model_p2d, cbar=True, clabel='$p(x,y)$: model fit')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$');
_images/pdfs_and_normalization_97_0.png

We can use this to plot the “truth” as well:

[54]:
h_truth_p2d = hl.hist_from_eval(
    (lambda x, y: model_p2d((x,y), 0, np.exp(1), 0, 1, np.radians(30))),
    vectorize=False, bins=k2d.bins)
[55]:
fig, ax = plt.subplots()
hl.plot2d(h_truth_p2d, cbar=True, clabel='$p(x,y)$: truth')
ax.set_xlabel('$x$')
ax.set_ylabel('$y$');
_images/pdfs_and_normalization_100_0.png

Let’s sum this up by plotting the various parameterizations:

[56]:
fig, axs = plt.subplots(2, 3, figsize=(15, 8))
axs = np.ravel(axs)
kw = dict(cbar=True, vmax=.06)
bins = p2d.bins
hl.plot2d(axs[0], p2d,
          clabel='$p(x,y)$: hist', **kw)
hl.plot2d(axs[1], k2d,
          clabel='$p(x,y)$: KDE', **kw)
hl.plot2d(axs[3], s2d.eval(bins=200),
          clabel='$p(x,y)$: spline fit', **kw)
hl.plot2d(axs[4], h_model_p2d,
          clabel='$p(x,y)$: model fit', **kw)
hl.plot2d(axs[5], h_truth_p2d,
          clabel='$p(x,y)$: truth', **kw)
for ax in axs:
    ax.set_xlabel(r'$x$')
    ax.set_ylabel(r'$y$')
axs[2].set_visible(False)
_images/pdfs_and_normalization_102_0.png

and their ratios with respect to the truth:

[57]:
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
axs = np.ravel(axs)
kw = dict(cbar=True, log=True, vmin=1e-1, vmax=1e1, cmap='RdBu_r')
bins = p2d.bins
hl.plot2d(axs[0], p2d.eval(bins=bins) / h_truth_p2d.eval(bins=bins),
          clabel='$p(x,y)/\mathsf{truth}$: hist', **kw)
hl.plot2d(axs[1], k2d.eval(bins=bins) / h_truth_p2d.eval(bins=bins),
          clabel='$p(x,y)/\mathsf{truth}$: KDE', **kw)
hl.plot2d(axs[2], s2d.eval(bins=bins) / h_truth_p2d.eval(bins=bins),
          clabel='$p(x,y)/\mathsf{truth}$: spline fit', **kw)
hl.plot2d(axs[3], h_model_p2d.eval(bins=bins) / h_truth_p2d.eval(bins=bins),
          clabel='$p(x,y)/\mathsf{truth}$: model fit', **kw)
for ax in axs:
    ax.set_xlabel(r'$x$')
    ax.set_ylabel(r'$y$')
_images/pdfs_and_normalization_104_0.png

Whoa, the outliers here look pretty bad. Comparing by difference is a little less pessimistic:

[58]:
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
axs = np.ravel(axs)
kw = dict(cbar=True, vmin=-.03, vmax=.03, cmap='RdBu_r')
bins = p2d.bins
ref = h_truth_p2d.eval(bins=bins)
hl.plot2d(axs[0], p2d.eval(bins=bins) - ref,
          clabel='$p(x,y)-\mathsf{truth}$: hist', **kw)
hl.plot2d(axs[1], k2d.eval(bins=bins) - ref,
          clabel='$p(x,y)-\mathsf{truth}$: KDE', **kw)
hl.plot2d(axs[2], s2d.eval(bins=bins) - ref,
          clabel='$p(x,y)-\mathsf{truth}$: spline fit', **kw)
hl.plot2d(axs[3], h_model_p2d.eval(bins=bins) - ref,
          clabel='$p(x,y)-\mathsf{truth}$: model fit', **kw)
for ax in axs:
    ax.set_xlabel(r'$x$')
    ax.set_ylabel(r'$y$')
_images/pdfs_and_normalization_106_0.png

Random Sampling

Histlite allows random sampling in any number of dimensions. As an example, consider:

[59]:
N, seed = 5, 1
print(h2d.sample(N, seed=seed))
print(p2d.sample(N, seed=seed))
print(s2d.eval(bins=100).sample(N, seed=seed))
[array([-0.45783874,  1.24470245, -5.91706543, -1.34477581, -2.51068398]), array([-2.05939332,  1.60445268, -4.03093146,  2.37074818,  0.24657302])]
[array([-0.45783874,  1.24470245, -5.91706543, -1.34477581, -2.51068398]), array([-2.05939332,  1.60445268, -4.03093146,  2.37074818,  0.24657302])]
[array([-0.46891937,  1.34235123, -5.95853271, -1.1523879 , -2.45534199]), array([-2.94969666,  0.68222634, -3.69546573, -3.01462591, -0.59671349])]

Note, however, that with more dimensions, far increased statistics are required in the input data for the sampling to be stable with respect to modeling choices.

It is also possible to sample from a subset of dimensions, after specifying some of the parameters:

[60]:
for x in [-3, 0, 3]:
    print(h2d.sample(N, x, seed=seed))
[array([-2.25825073, -1.30681426, -1.95666546, -0.71195175, -3.20477583])]
[array([-0.49287041, -0.57318781,  0.07838039, -0.0958898 ,  0.03371492])]
[array([0.88990424, 3.48293473, 2.40766102, 2.91407529, 2.77436452])]
[61]:
for y in [-3, 0, 3]:
    print(h2d.T.sample(N, y, seed=seed))
[array([-2.0405207 , -0.4614405 , -2.43781501, -1.76153886, -4.42869307])]
[array([-3.41787792, -3.53784644, -2.12928192,  1.98962526, -0.6510304 ])]
[array([4.864174  , 4.7018449 , 2.37746693, 4.06817974, 4.8371715 ])]

Conclusions

One of the primary design goals of histlite is to serve as a convenient foundation for various higher-level analysis — especially maximum likelihood analysis. It was originally written to support astrophysical neutrino sources in IceCube. Recently, histlite is also being investigated as a building block in neutrinoless double-beta (0νββ) decay searches by nEXO.

Here we’ve surveyed features supporting 1D and 2D PDFs. It’s no problem to work with higher dimensional PDFs, with the exception that SciPy spline routines become extremely RAM-intensive as the dimensionality increases.

What’s next? Contact me if you have any requests — I think the next tutorial will cover slicing, projections, and more details on histogram arithmetic.

API reference

Calculate and plot histograms, easily.

class histlite.Hist(bins, values, errors=None, data=None, weights=None)[source]

A histogram.

T

A transposed version of the Hist.

bins

A list of bin edge arrays, one for each dimension.

centers

A list of bin center arrays for each dimension.

contain(axis, frac=0.6826894921370859)[source]

Project the histogram onto a subset of its dimensions by taking the containment interval along axis.

Parameters:
  • axis (int) – the axis along which to measure containment and thus the dimensions no longer present in the resulting Hist.
  • frac (float) – the containment interval, measured from self.range[axis][0] and moving to the “right”
Returns:

Hist

contain_project(axis, frac=0.6826894921370859, n_sigma=None)[source]

Project the histogram taking median along one dimension, with errorbars reflecting the eliminated axis.

Parameters:
  • axis (int) – the axis along which to measure containment and thus the dimensions no longer present in the resulting Hist.
  • frac (float) – the containment fraction, which will be centered on the median value along the given axis
  • n_sigma (float) – the containment fraction, specified as a number of sigmas
Returns:

Hist

If given, n_sigma overrides frac.

cumsum(axes=[-1], normalize=False)[source]

Calculate the cumulative sum along specified axes (in order).

curve_fit(func, **kw)[source]

Fit a function to the histogram.

Parameters:func (function) – model function as in scipy.optimize.curve_fit().
Returns:popt, pcov as in scipy.optimize.curve_fit()

This function unravels the values and bin centers into n_dim 1D arrays which are then passed, along with any keyword arguments, to scipy.optimize.curve_fit().

data

The data used to construct the histogram (if given upon construction).

efficiency(base_hist)[source]

Get an efficiency Hist for this Hist divided by base_hist.

Parameters:base_hist (Hist) – The base histogram, of which this one should be a subset.

This method differs from __div__ in the way that errors are propagated.

errors

An nD array of bin errors (sqrt(sum(squares of weights))) in each bin.

gaussian_filter(*a, **kw)[source]

Smooth both values and errors with scipy.ndimage.gaussian_filter().

gaussian_filter1d(*a, **kw)[source]

Smooth both values and errors with scipy.ndimage.gaussian_filter1d().

get_error(*xs)[source]

Get the error value at the specified coordinates.

get_errors(*xs)[source]

Get the error values at the specified lists of coordinates.

get_value(*xs)[source]

Get the counts value at the specified coordinates.

get_values(*xs)[source]

Get the counts values at the specified lists of coordinates.

index(x, axis=0)[source]

The bin index for value x in dimension axis.

indices(*xs)[source]

Get the indices for the specified coordinates.

integrate(axes=None)[source]

Project the histogram onto a subset of its dimensions by integrating over axes.

Parameters:axes (sequence of int) – the axes along which to integrate and thus the dimensions no longer present in the resulting Hist.
Returns:Hist
log

A list of bools describing whether each dimenion is binned linearly or logarithmically.

matches(other)[source]

True if self and other have the same binning.

median(axis)[source]

Project the histogram onto a subset of its dimensions by taking the median along axis.

Parameters:axis (int) – the axis along which to find the median and thus the dimensions no longer present in the resulting Hist.
Returns:Hist
n_bins

A list of the number of bins in each dimension.

n_dim

The number of dimensions in the histogram.

normalize(axes=None, integrate=True, density=False)[source]

Return a histogram normalized so the sums (or integrals) over the given axes are unity.

Parameters:
  • axes (sequence of int, optional) – the axes that will sum (or integrate) to unity for the normalized histogram
  • integrate (bool) – if True, normalize so the integral is unity; otherwise, normalize so the sum is unity
  • density (bool) – if True, normalize so the integral is unity, but as though the binning were linspaced, even if it is actually not. This option supersedes the integrate argument.
Returns:

Hist

The norm is found by summing over all axes other than the ones specified, or by summing over all axes if axis is not given. Note that setting density=True should obtain equivalent behavior to numpy.histogram(..., density=True).

project(axes=[-1], integrate=False)[source]

Project the histogram onto a subset of its dimensions by summing over axes other than those listed in axes.

Parameters:axes (sequence of int) – the axes along which NOT to sum or integrate, and thus the dimensions no longer present in the resulting Hist.
Returns:Hist
range

A list of (min_value,max_value) tuples for each dimension.

ravel_indices(*xs)[source]

Get the indices into values.ravel() for the specified coordinates.

Index of -1 indicates out-of-bounds

rebin(axis, bins, tol=0.0001)[source]

Produce coarser binning along the given axis.

Parameters:
  • axis (int) – the axis along which to rebin
  • bins (sequence) – the new bin edges
  • tol (float) – the absolute error between the given bins and ones found in the original histogram
Returns:

Hist

Each bin in bins should be contained in the existing bins, and the endpoints should match. Tolerance for bin agreement is given as an absolute error by tol.

sample(n_samples=1, *values, **kw)[source]

Draw n samples from the data.

Parameters:n_samples (int) – the number of samples

Any given values select bins in the first len(values) dimensions, such that sampling is done only from the remaining dimensions.

Returns:tuple of arrays of length n_dim
spline_fit(log=False, floor=None, *a, **kw)[source]

Get a scipy spline fit to the histogram.

Parameters:
  • log (bool) – whether to fit in log-value or linear-value
  • floor (float) – 10**floor is the arbitrary small number to stand in for zeros when log is true. If not set, log10 of 0.1 times the smallest nonzero value will be used.
Returns:

SplineFit

This method produces a spline fit to the histogram values for 1D or 2D histograms. The domain of the fitted spline will be the same as that of the histogram.

sum(axes=None, integrate=False)[source]

Project the histogram onto a subset of its dimensions by summing over axes.

Parameters:
  • axes (sequence of int) – the axes along which to sum and thus the dimensions no longer present in the resulting Hist.
  • integrate (bool) – if True, evaluate sum of (value * width) rather than just value.
Returns:

Hist

values

An nD array of bin values (sum of weights in each bin).

volumes

An nD array of bin volumes (product of bin widths in each dimension).

weights

The weights used to construct the histogram (if given upon construction).

widths

A list of bin width arrays for each dimension.

class histlite.LineStyle(line=True, marker='None', errorbars=False, errorcaps=False, errorbands=False, xerrorbars=False, crosses=False, poisson=False, **kwargs)[source]

Style object for 1D Hist s.

copy(**kwargs)[source]

Get a copy of this LineStyle, updating the given keyword args.

All arguments accepted by the LineStyle constructor may be given, including line, markers, errorbars, errorcaps, and arbitrary matplotlib arguments.

errorbands

Whether to draw error bands.

errorbars

Whether to draw error bars.

errorcaps

Whether to draw error bar caps.

kwargs

Keyword args for matplotlib.axes.Axes.errorbar().

line

Whether to draw a line.

marker

The marker to use.

markers

Whether to draw point markers.

poisson

Whether to draw Poisson error bars.

update(**kwargs)[source]

Update the keyword args with the given values.

xerrorbars

Whether to draw error bars.

class histlite.SplineFit(hist, spline, bin_logs, log, floor)[source]

Wrapper for spline fits to histograms.

histlite.fill_between(ax, h1, h2, interpolate=False, drawstyle='steps', **kw)[source]

Fill the region between histograms h1 and h2.

Parameters:
  • ax (matplotlib Axes) – the axes on which to plot
  • h1 (number or Hist) – a number or the first histogram
  • h2 (number or Hist) – a number or the second histogram
  • where – see pyplot.fill_between().
  • interpolate – see pyplot.fill_between().
  • drawstyle (str) – if ‘line’, plot smooth curves; otherwise, plot with histogram steps

Other keyword arguments are passed to pyplot.fill_between().

histlite.hist(data, weights=None, bins=10, range=None, log=False, round_int_bins=False, keep_data=False)[source]

Factory function for Hist.

Parameters:
  • data (array-like or tuple of array-like) – the source data for the histogram: a tuple of array-like each of length (number of samples), or just a single array of that length
  • weights (array-like) – the weights of the source datapoints
  • bins (sequence or int, optional) – a numpy.histogramdd() bins specification
  • range (sequence, optional) – a numpy.histogramdd() range specification
  • log (sequence or bool) – if bins gives bin counts rather than edges, log=True causes logarithmic bin edges to be chosen (can be given per-dimension)
  • keep_data (bool) – whether to keep the data and weights for the histogram
Returns:

the Hist

histlite.hist_bootstrap(N, data, weights=None, *args, **kwargs)[source]

Like hist(), but for N iterations of sampling from data with replacement.

Parameters:
  • stacked (bool) – if True, output is an ndim+1 dimensional Hist where the first dimension has n_bins=``N`` and values are sorted along this first dimension. otherwise output is ndim dimensional
  • errors (str) – ‘original’ to obtain errorbars from standard (non-bootstrapped) Hist; ‘bootstrap’ to obtain errors from containment interval along the Hist that would have been returned if stacked=True. default is ‘bootstrap’
  • frac (float in [0,1]) – fraction of samples to use in each iteration. default is 1.0
  • slide_Ns (float or sequence of float) – if given, use hist_slide() instead of hist(), with the given number(s) of slided binnings
histlite.hist_direct(data, weights=None, bins=None, range=None, keep_data=False)[source]

Fast factory function for Hist.

Parameters:
  • data (array-like) – the data to be histogrammed. For multidimensional histograms, data will be transposed for input into np.histogramdd().
  • weights (array-like) – the weights
  • bins (sequence or int, optional) – a np.histogramdd() bins specification
  • bins – a np.histogramdd() range specification
  • keep_data (bool) – whether to keep the data and weights for the histogram
Returns:

the Hist

This method creates a Hist by calling np.histogramdd() as directly as possible. Note that this requires a slightly more constrained format for data compared with hist().

histlite.hist_from_eval(f, vectorize=True, err_f=None, ndim=None, **kwargs)[source]

Create a Hist by evaluating a function.

Parameters:
  • f (callable) – the function to evaluate
  • ndim (int) – number of arguments to function
  • vectorize (bool) – whether numpy.vectorize is needed to evaluate f over many sets of values
  • err_f (callable) – if given, function to evaluate to obtain “errors”

All other keyword arguments define the binning the same as for hist().

This function supersedes hist_from_function().

histlite.hist_from_function(bins, func, *args, **kwargs)[source]

[Deprecated] Create a Hist by evaluating func at the centers of specified bins.

Parameters:
  • bins (list of np.array) – the bin edges
  • func (function) – the function to evaluate. it should return floats and respect numpy broadcasting rules. if splat is true, it should take len(bins) (that is, n_dim) arguments; otherwise, it should accept a single argument containing all values of all independent variables
  • err_func (function) – if given, the function that returns the uncertainty at each part of the parameter space
  • splat (bool) – determines the signature of func as described above (default: True)
Returns:

the Hist

Any additional positional or keyword arguments are passed to func.

Note: This function is now deprecated; use hist_from_eval() instead.

histlite.hist_like(other, data, weights=None, keep_data=False)[source]

Create a Hist using the same binning as other.

Parameters:
  • other (Hist) – the other Hist
  • data (array-like) – the data to be histogrammed. For multidimensional histograms, data will be transposed for input into np.histogramdd().
  • weights (array-like) – the weights
Returns:

the Hist

histlite.hist_like_indices(other, ravel_indices, weights=None)[source]

Create a Hist using pre-computed per-sample indices from other.

Parameters:
  • other (Hist) – the pre-existing histogram which defines the binning
  • ravel_indices (array-like) – result of a Hist.ravel_indices() call for the samples.
  • weights (array-like) – the weights of the source datapoints
histlite.hist_slide(Ns, data, weights=None, *args, **kwargs)[source]

Construct a “histogram” from N partially-overlapping Hist’s.

Parameters:
  • Ns (int or sequence of int) – number of sliding iterations, optionally per-axis
  • data (array-like or tuple of array-like) – the source data for the histogram: a tuple of array-like each of length (number of samples), or just a single array of that length
  • weights (array-like) – the weights of the source datapoints
  • indices (sequence of array-like) – ravel_indices for each individual histogram, obtained from a previous hist_slide call
  • get_indices (bool) – if True, obtain ravel_indices values for use in later hist_slide calls
histlite.plot1d(ax, h=None, style=None, **kwargs)[source]

Plot 1D Hist h on matplotlib Axes ax.

Parameters:
  • ax (matplotlib Axes) – if given, the axes on which to plot
  • h (Hist) – the 1D histogram
  • style (LineStyle) – the line style

Other keyword arguments are propagated to pyplot.errorbar() and pyplot.plot() as appropriate.

histlite.plot2d(ax, h=None, log=False, cbar=False, levels=None, **kwargs)[source]

Plot 1D Hist h on ax on a color scale.

Parameters:
  • ax (matplotlib Axes) – The main axes on which to plot.
  • h (Hist) – The 2D histogram to plot.
  • log (bool) – Whether to use a log color scale
  • cbar (bool) – If true, draw colorbar.
  • levels (int or array of float) – if given, plot with contourf rather than pcolormesh. if a number is given, automatically select that many levels between vmin and vmax.
  • zmin (float) – Minimum value to plot with color; bins below the minimum value will be white.
Returns:

If cbar, a dict containing a matplotlib.collection.QuadMesh and a matplotlib.colorbar.Colorbar as values; otherwise, just the QuadMesh.

Other keyword arguments are passed to ax.pcolormesh().

histlite.reindex(a, order)[source]

Rearrange the axes of a multidimensional array.

Parameters:
  • a (ndarray) – the input array
  • order (sequence of int) – the axis that should wind up in each ordinal position
Returns:

reindexed array

Note: this is useful for implementing Hist.sum(), etc., but you probably should prefer np.swapaxes, possibly using multiple applications, instead.

histlite.stack1d(ax, hs, colors=None, labels=None, kws=None, interpolate=False, drawstyle='steps', ymin=0, **morekw)[source]

Stack histograms hs using fill_between().

Parameters:
  • ax (matplotlib Axes) – the axes on which to plot
  • hs (sequence of Hist) – the histograms
  • colors (sequence) – the fill colors
  • labels (sequence of str) – the labels
  • kws (sequence of str to value mappings) – keyword arguments for individual fills
  • interpolate – see pyplot.fill_between()
  • drawstyle (str) – if ‘line’, plot smooth curves; otherwise, plot with histogram steps
  • ymin (number) – minimum value (useful for semilogy plots)

Other keyword arguments are passed to each fill_between() call.

histlite.unreindex(a, order)[source]

Reverse the effects of reindex().

Parameters:
  • a (ndarray) – the already reindexed array
  • order (sequence of int) – order previously applied to reindex()
Returns:

unreindexed array

Note: this is useful for implementing Hist.sum(), etc., but you probably should prefer np.swapaxes, possibly using multiple applications, instead.