# IntroductionÂ¶

Measurements in High Energy Physics (HEP) rely on determining the
compatibility of observed collision events with theoretical predictions.
The relationship between them is often formalised in a statistical *model*
\(f(\bm{x}|\fullset)\) describing the probability of data
\(\bm{x}\) given model parameters \(\fullset\). Given observed
data, the *likelihood* \(\mathcal{L}(\fullset)\) then serves as the basis to test
hypotheses on the parametersÂ \(\fullset\). For measurements based
on binned data (*histograms*), the \(\HiFa{}\) family of statistical models has been widely used
in both Standard Model measurementsÂ [4] as
well as searches for new
physicsÂ [5]. In this package, a
declarative, plain-text format for describing \(\HiFa{}\)-based likelihoods is
presented that is targeted for reinterpretation and long-term
preservation in analysis data repositories such as
HEPDataÂ [3].

## HistFactoryÂ¶

Statistical models described using \(\HiFa{}\) [2]
center around the simultaneous measurement of disjoint binned
distributions (*channels*) observed as event counts \(\channelcounts\). For
each channel, the overall expected event rate 1 is the sum over a
number of physics processes (*samples*). The sample rates may be subject to
parametrised variations, both to express the effect of *free parameters*
\(\freeset\) 2 and to account for systematic uncertainties as a
function of *constrained parameters* \(\constrset\). The degree to which the latter can cause
a deviation of the expected event rates from the nominal rates is
limited by *constraint terms*. In a frequentist framework these constraint terms can be
viewed as *auxiliary measurements* with additional global observable data \(\auxdata\), which
paired with the channel data \(\channelcounts\) completes the
observation \(\bm{x} =
(\channelcounts,\auxdata)\). In addition to the partition of the full
parameter set into free and constrained parameters \(\fullset =
(\freeset,\constrset)\), a separate partition \(\fullset =
(\poiset,\nuisset)\) will be useful in the context of hypothesis testing,
where a subset of the parameters are declared *parameters of interest* \(\poiset\) and the
remaining ones as *nuisance parameters* \(\nuisset\).

Thus, the overall structure of a \(\HiFa{}\) probability model is a product of the analysis-specific model term describing the measurements of the channels and the analysis-independent set of constraint terms:

where within a certain integrated luminosity we observe \(n_{cb}\) events given the expected rate of events \(\nu_{cb}(\freeset,\constrset)\) as a function of unconstrained parameters \(\freeset\) and constrained parameters \(\constrset\). The latter has corresponding one-dimensional constraint terms \(c_\singleconstr(a_\singleconstr|\,\singleconstr)\) with auxiliary data \(a_\singleconstr\) constraining the parameter \(\singleconstr\). The event rates \(\nu_{cb}\) are defined as

The total rates are the sum over sample rates \(\nu_{csb}\), each
determined from a *nominal rate* \(\nu_{scb}^0\) and a set of multiplicative and
additive denoted *rate modifiers* \(\bm{\kappa}(\fullset)\) and
\(\bm{\Delta}(\fullset)\). These modifiers are functions of (usually
a single) model parameters. Starting from constant nominal rates, one
can derive the per-bin event rate modification by iterating over all
sample rate modifications as shown inÂ (3).

As summarised in Modifiers and Constraints, rate modifications are defined in \(\HiFa{}\) for bin \(b\), sample \(s\), channel \(c\). Each modifier is represented by a parameter \(\phi \in \{\gamma, \alpha, \lambda, \mu\}\). By convention bin-wise parameters are denoted with \(\gamma\) and interpolation parameters with \(\alpha\). The luminosity \(\lambda\) and scale factors \(\mu\) affect all bins equally. For constrained modifiers, the implied constraint term is given as well as the necessary input data required to construct it. \(\sigma_b\) corresponds to the relative uncertainty of the event rate, whereas \(\delta_b\) is the event rate uncertainty of the sample relative to the total event rate \(\nu_b = \sum_s = \nu^0_{sb}\).

Modifiers implementing uncertainties are paired with a corresponding default constraint term on the parameter limiting the rate modification. The available modifiers may affect only the total number of expected events of a sample within a given channel, i.e. only change its normalisation, while holding the distribution of events across the bins of a channel, i.e. its â€śshapeâ€ť, invariant. Alternatively, modifiers may change the sample shapes. Here \(\HiFa{}\) supports correlated an uncorrelated bin-by-bin shape modifications. In the former, a single nuisance parameter affects the expected sample rates within the bins of a given channel, while the latter introduces one nuisance parameter for each bin, each with their own constraint term. For the correlated shape and normalisation uncertainties, \(\HiFa{}\) makes use of interpolating functions, \(f_p\) and \(g_p\), constructed from a small number of evaluations of the expected rate at fixed values of the parameter \(\alpha\) 3. For the remaining modifiers, the parameter directly affects the rate.

Description |
Modification |
Constraint Term \(c_\singleconstr\) |
Input |
---|---|---|---|

Uncorrelated Shape |
\(\kappa_{scb}(\gamma_b) = \gamma_b\) |
\(\prod_b \mathrm{Pois}\left(r_b = \sigma_b^{-2}\middle|\,\rho_b = \sigma_b^{-2}\gamma_b\right)\) |
\(\sigma_{b}\) |

Correlated Shape |
\(\Delta_{scb}(\alpha) = f_p\left(\alpha\middle|\,\Delta_{scb,\alpha=-1},\Delta_{scb,\alpha = 1}\right)\) |
\(\displaystyle\mathrm{Gaus}\left(a = 0\middle|\,\alpha,\sigma = 1\right)\) |
\(\Delta_{scb,\alpha=\pm1}\) |

Normalisation Unc. |
\(\kappa_{scb}(\alpha) = g_p\left(\alpha\middle|\,\kappa_{scb,\alpha=-1},\kappa_{scb,\alpha=1}\right)\) |
\(\displaystyle\mathrm{Gaus}\left(a = 0\middle|\,\alpha,\sigma = 1\right)\) |
\(\kappa_{scb,\alpha=\pm1}\) |

MC Stat. Uncertainty |
\(\kappa_{scb}(\gamma_b) = \gamma_b\) |
\(\prod_b \mathrm{Gaus}\left(a_{\gamma_b} = 1\middle|\,\gamma_b,\delta_b\right)\) |
\(\delta_b^2 = \sum_s\delta^2_{sb}\) |

Luminosity |
\(\kappa_{scb}(\lambda) = \lambda\) |
\(\displaystyle\mathrm{Gaus}\left(l = \lambda_0\middle|\,\lambda,\sigma_\lambda\right)\) |
\(\lambda_0,\sigma_\lambda\) |

Normalisation |
\(\kappa_{scb}(\mu_b) = \mu_b\) |
||

Data-driven Shape |
\(\kappa_{scb}(\gamma_b) = \gamma_b\) |

Given the likelihood \(\mathcal{L}(\fullset)\), constructed from
observed data in all channels and the implied auxiliary data, *measurements* in the
form of point and interval estimates can be defined. The majority of the
parameters are *nuisance parameters* â€” parameters that are not the main target of the
measurement but are necessary to correctly model the data. A small
subset of the unconstrained parameters may be declared as *parameters of interest* for which
measurements hypothesis tests are performed, e.g. profile likelihood
methodsÂ [1]. The Symbol Notation table provides a summary of all the
notation introduced in this documentation.

Symbol |
Name |
---|---|

\(f(\bm{x} | \fullset)\) |
model |

\(\mathcal{L}(\fullset)\) |
likelihood |

\(\bm{x} = \{\channelcounts, \auxdata\}\) |
full dataset (including auxiliary data) |

\(\channelcounts\) |
channel data (or event counts) |

\(\auxdata\) |
auxiliary data |

\(\nu(\fullset)\) |
calculated event rates |

\(\fullset = \{\freeset, \constrset\} = \{\poiset, \nuisset\}\) |
all parameters |

\(\freeset\) |
free parameters |

\(\constrset\) |
constrained parameters |

\(\poiset\) |
parameters of interest |

\(\nuisset\) |
nuisance parameters |

\(\bm{\kappa}(\fullset)\) |
multiplicative rate modifier |

\(\bm{\Delta}(\fullset)\) |
additive rate modifier |

\(c_\singleconstr(a_\singleconstr | \singleconstr)\) |
constraint term for constrained parameter \(\singleconstr\) |

\(\sigma_\singleconstr\) |
relative uncertainty in the constrained parameter |

## Declarative FormatsÂ¶

While flexible enough to describe a wide range of LHC measurements, the
design of the \(\HiFa{}\) specification is sufficiently simple to admit a *declarative format* that fully
encodes the statistical model of the analysis. This format defines the
channels, all associated samples, their parameterised rate modifiers and
implied constraint terms as well as the measurements. Additionally, the
format represents the mathematical model, leaving the implementation of
the likelihood minimisation to be analysis-dependent and/or
language-dependent. Originally XML was chosen as a specification
language to define the structure of the model while introducing a
dependence on \(\Root{}\) to encode the nominal rates and required input data of the
constraint termsÂ [2]. Using this
specification, a model can be constructed and evaluated within the
\(\RooFit{}\) framework.

This package introduces an updated form of the specification based on
the ubiquitous plain-text JSON format and its schema-language *JSON Schema*.
Described in more detail inÂ Likelihood Specification, this schema fully specifies both structure
and necessary constrained data in a single document and thus is
implementation independent.

## Additional MaterialÂ¶

### FootnotesÂ¶

- 1
Here rate refers to the number of events expected to be observed within a given data-taking interval defined through its integrated luminosity. It often appears as the input parameter to the Poisson distribution, hence the name â€śrateâ€ť.

- 2
These

*free parameters*frequently include the of a given process, i.e. its cross-section normalised to a particular reference cross-section such as that expected from the Standard Model or a given BSM scenario.- 3
This is usually constructed from the nominal rate and measurements of the event rate at \(\alpha=\pm1\), where the value of the modifier at \(\alpha=\pm1\) must be provided and the value at \(\alpha=0\) corresponds to the corresponding identity operation of the modifier, i.e. \(f_{p}(\alpha=0) = 0\) and \(g_{p}(\alpha = 0)=1\) for additive and multiplicative modifiers respectively. See Section 4.1 inÂ [2].

### BibliographyÂ¶

- 1
Glen Cowan, Kyle Cranmer, Eilam Gross, and Ofer Vitells. Asymptotic formulae for likelihood-based tests of new physics.

*Eur. Phys. J. C*, 71:1554, 2011. arXiv:1007.1727, doi:10.1140/epjc/s10052-011-1554-0.- 2(1,2,3)
Kyle Cranmer, George Lewis, Lorenzo Moneta, Akira Shibata, and Wouter Verkerke. HistFactory: A tool for creating statistical models for use with RooFit and RooStats. Technical Report CERN-OPEN-2012-016, New York U., New York, Jan 2012. URL: https://cds.cern.ch/record/1456844.

- 3
Eamonn Maguire, Lukas Heinrich, and Graeme Watt. HEPData: a repository for high energy physics data.

*J. Phys. Conf. Ser.*, 898(10):102006, 2017. arXiv:1704.05473, doi:10.1088/1742-6596/898/10/102006.- 4
ATLAS Collaboration. Measurements of Higgs boson production and couplings in diboson final states with the ATLAS detector at the LHC.

*Phys. Lett. B*, 726:88, 2013. arXiv:1307.1427, doi:10.1016/j.physletb.2014.05.011.- 5
ATLAS Collaboration. Search for supersymmetry in final states with missing transverse momentum and multiple \(b\)-jets in protonâ€“proton collisions at \(\sqrt s = 13\) \(\TeV \) with the ATLAS detector. ATLAS-CONF-2018-041, 2018. URL: https://cds.cern.ch/record/2632347.

# Likelihood SpecificationÂ¶

The structure of the JSON specification of models follows closely the original XML-based specificationÂ [2].

## WorkspaceÂ¶

```
{
"$schema": "http://json-schema.org/draft-06/schema#",
"$id": "https://diana-hep.org/pyhf/schemas/1.0.0/workspace.json",
"type": "object",
"properties": {
"channels": { "type": "array", "items": {"$ref": "defs.json#/definitions/channel"} },
"measurements": { "type": "array", "items": {"$ref": "defs.json#/definitions/measurement"} },
"observations": { "type": "array", "items": {"$ref": "defs.json#/definitions/observation" } },
"version": { "const": "1.0.0" }
},
"additionalProperties": false,
"required": ["channels", "measurements", "observations", "version"]
}
```

The overall document inÂ the above code snippet describes a *workspace*, which includes

**channels**: The channels in the model, which include a description of the samples within each channel and their possible parametrised modifiers.**measurements**: A set of measurements, which define among others the parameters of interest for a given statistical analysis objective.**observations**: The observed data, with which a likelihood can be constructed from the model.

A workspace consists of the channels, one set of observed data, but can include multiple measurements. If provided a JSON file, one can quickly check that it conforms to the provided workspace specification as follows:

```
import json, requests, jsonschema
workspace = json.load(open('/path/to/analysis_workspace.json'))
# if no exception is raised, it found and parsed the schema
schema = requests.get('https://diana-hep.org/pyhf/schemas/1.0.0/workspace.json').json()
# If no exception is raised by validate(), the instance is valid.
jsonschema.validate(instance=workspace, schema=schema)
```

## ChannelÂ¶

A channel is defined by a channel name and a list of samples [1].

```
{
"channel": {
"type": "object",
"properties": {
"name": { "type": "string" },
"samples": { "type": "array", "items": {"$ref": "#/definitions/sample"}, "minItems": 1 }
},
"required": ["name", "samples"],
"additionalProperties": false
},
}
```

The Channel specification consists of a list of channel descriptions.
Each channel, an analysis region encompassing one or more measurement
bins, consists of a `name`

field and a `samples`

field (seeÂ Channel), which
holds a list of sample definitions (seeÂ Sample). Each sample definition in
turn has a `name`

field, a `data`

field for the nominal event rates
for all bins in the channel, and a `modifiers`

field of the list of
modifiers for the sample.

## SampleÂ¶

A sample is defined by a sample name, the sample event rate, and a list of modifiers [1].

```
{
"sample": {
"type": "object",
"properties": {
"name": { "type": "string" },
"data": { "type": "array", "items": {"type": "number"}, "minItems": 1 },
"modifiers": {
"type": "array",
"items": {
"anyOf": [
{ "$ref": "#/definitions/modifier/histosys" },
{ "$ref": "#/definitions/modifier/lumi" },
{ "$ref": "#/definitions/modifier/normfactor" },
{ "$ref": "#/definitions/modifier/normsys" },
{ "$ref": "#/definitions/modifier/shapefactor" },
{ "$ref": "#/definitions/modifier/shapesys" },
{ "$ref": "#/definitions/modifier/staterror" }
]
}
}
},
"required": ["name", "data", "modifiers"],
"additionalProperties": false
},
}
```

## ModifiersÂ¶

The modifiers that are applicable for a given sample are encoded as a list of JSON objects with three fields. A name field, a type field denoting the class of the modifier, and a data field which provides the necessary input data as denoted inÂ Modifiers and Constraints.

Based on the declared modifiers, the set of parameters and their
constraint terms are derived implicitly as each type of modifier
unambiguously defines the constraint terms it requires. Correlated shape
modifiers and normalisation uncertainties have compatible constraint
terms and thus modifiers can be declared that *share* parameters by
re-using a name 1 for multiple modifiers. That is, a variation of a
single parameter causes a shift within sample rates due to both shape
and normalisation variations.

We review the structure of each modifier type below.

### Normalisation Uncertainty (normsys)Â¶

The normalisation uncertainty modifies the sample rate by a overall factor \(\kappa(\alpha)\) constructed as the interpolation between downward (â€śloâ€ť) and upward (â€śhiâ€ť) as well as the nominal setting, i.e. \(\kappa(-1) = \kappa_{\alpha=-1}\), \(\kappa(0) = 1\) and \(\kappa(+1) = \kappa_{\alpha=+1}\). In the modifier definition we record \(\kappa_{\alpha=+1}\) and \(\kappa_{\alpha=-1}\) as floats. An example is shown below:

```
{ "name": "mod_name", "type": "normsys", "data": {"hi": 1.1, "lo": 0.9} }
```

An example of a normalisation uncertainty modifier with scale factors recorded for the up/down variations of an \(n\)-bin channel.

### MC Statistical Uncertainty (staterror)Â¶

As the sample counts are often derived from Monte Carlo (MC) datasets, they necessarily carry an uncertainty due to the finite sample size of the datasets. As explained in detail inÂ [2], adding uncertainties for each sample would yield a very large number of nuisance parameters with limited utility. Therefore a set of bin-wise scale factors \(\gamma_b\) is introduced to model the overall uncertainty in the bin due to MC statistics. The constrained term is constructed as a set of Gaussian constraints with a central value equal to unity for each bin in the channel. The scales \(\sigma_b\) of the constraint are computed from the individual uncertainties of samples defined within the channel relative to the total event rate of all samples: \(\delta_{csb} = \sigma_{csb}/\sum_s \nu^0_{scb}\). As not all samples are within a channel are estimated from MC simulations, only the samples with a declared statistical uncertainty modifier enter the sum. An example is shown below:

```
{ "name": "mod_name", "type": "staterror", "data": [0.1] }
```

An example of a statistical uncertainty modifier.

### Luminosity (lumi)Â¶

Sample rates derived from theory calculations, as opposed to data-driven estimates, are scaled to the integrated luminosity corresponding to the observed data. As the luminosity measurement is itself subject to an uncertainty, it must be reflected in the rate estimates of such samples. As this modifier is of global nature, no additional per-sample information is required and thus the data field is nulled. This uncertainty is relevant, in particular, when the parameter of interest is a signal cross-section. The luminosity uncertainty \(\sigma_\lambda\) is provided as part of the parameter configuration included in the measurement specification discussed inÂ Measurements. An example is shown below:

```
{ "name": "mod_name", "type": "lumi", "data": null }
```

An example of a luminosity modifier.

### Unconstrained Normalisation (normfactor)Â¶

The unconstrained normalisation modifier scales the event rates of a sample by a free parameter \(\mu\). Common use cases are the signal rate of a possible BSM signal or simultaneous in-situ measurements of background samples. Such parameters are frequently the parameters of interest of a given measurement. No additional per-sample data is required. An example is shown below:

```
{ "name": "mod_name", "type": "normfactor", "data": null }
```

An example of a normalisation modifier.

### Data-driven Shape (shapefactor)Â¶

In order to support data-driven estimation of sample rates (e.g. for multijet backgrounds), the data-driven shape modifier adds free, bin-wise multiplicative parameters. Similarly to the normalisation factors, no additional data is required as no constraint is defined. An example is shown below:

```
{ "name": "mod_name", "type": "shapefactor", "data": null }
```

An example of an uncorrelated shape modifier.

## DataÂ¶

The data provided by the analysis are the observed data for each channel (or region). This data is provided as a mapping from channel name to an array of floats, which provide the observed rates in each bin of the channel. The auxiliary data is not included as it is an input to the likelihood that does not need to be archived and can be determined automatically from the specification. An example is shown below:

```
{ "chan_name_one": [10, 20], "chan_name_two": [4, 0]}
```

An example of channel data.

## MeasurementsÂ¶

Given the data and the model definitions, a measurement can be defined. In the current schema, the measurements defines the name of the parameter of interest as well as parameter set configurations. 2 Here, the remaining information not covered through the channel definition is provided, e.g. for the luminosity parameter. For all modifiers, the default settings can be overridden where possible:

**inits**: Initial value of the parameter.**bounds**: Interval bounds of the parameter.**auxdata**: Auxiliary data for the associated constraint term.**sigmas**: Associated uncertainty of the parameter.

An example is shown below:

```
{
"name": "MyMeasurement",
"config": {
"poi": "SignalCrossSection", "parameters": [
{ "name":"lumi", "auxdata":[1.0],"sigmas":[0.017], "bounds":[[0.915,1.085]],"inits":[1.0] },
{ "name":"mu_ttbar", "bounds":[[0, 5]] },
{ "name":"rw_1CR", "fixed":true }
]
}
}
```

An example of a measurement. This measurement, which scans over the parameter of interest `SignalCrossSection`

, is setting configurations for the luminosity modifier, changing the default bounds for the normfactor modifier named `mu_ttbar`

, and specifying that the modifier `rw_1CR`

is held constant (`fixed`

).

## ObservationsÂ¶

This is what we evaluate the hypothesis testing against, to determine the compatibility of signal+background hypothesis to the background-only hypothesis. This is specified as a list of objects, with each object structured as

**name**: the channel for which the observations are recorded**data**: the bin-by-bin observations for the named channel

An example is shown below:

```
{
"name": "channel1",
"data": [110.0, 120.0]
}
```

An example of an observation. This observation recorded for a 2-bin channel `channel1`

, has values `110.0`

and `120.0`

.

## Toy ExampleÂ¶

```
{
"channels": [
{ "name": "singlechannel",
"samples": [
{ "name": "signal",
"data": [5.0, 10.0],
"modifiers": [ { "name": "mu", "type": "normfactor", "data": null} ]
},
{ "name": "background",
"data": [50.0, 60.0],
"modifiers": [ {"name": "uncorr_bkguncrt", "type": "shapesys", "data": [5.0, 12.0]} ]
}
]
}
],
"observations": [
{ "name": "singlechannel", "data": [50.0, 60.0] }
],
"measurements": [
{ "name": "Measurement", "config": {"poi": "mu", "parameters": []} }
],
"version": "1.0.0"
}
```

In the above example, we demonstrate a simple measurement of a single two-bin channel with two samples: a signal sample and a background sample. The signal sample has an unconstrained normalisation factor \(\mu\), while the background sample carries an uncorrelated shape systematic controlled by parameters \(\gamma_1\) and \(\gamma_2\). The background uncertainty for the bins is 10% and 20% respectively.

## Additional MaterialÂ¶

### FootnotesÂ¶

- 1
The name of a modifier specifies the parameter set it is controlled by. Modifiers with the same name share parameter sets.

- 2
In this context a parameter set corresponds to a named lower-dimensional subspace of the full parameters \(\fullset\). In many cases these are one-dimensional subspaces, e.g. a specific interpolation parameter \(\alpha\) or the luminosity parameter \(\lambda\). For multi-bin channels, however, e.g. all bin-wise nuisance parameters of the uncorrelated shape modifiers are grouped under a single name. Therefore in general a parameter set definition provides arrays of initial values, bounds, etc.

### BibliographyÂ¶

- 1(1,2)
Histfactory definitions schema. Accessed: 2019-06-20. URL: https://diana-hep.org/pyhf/schemas/1.0.0/defs.json.

- 2(1,2)
Kyle Cranmer, George Lewis, Lorenzo Moneta, Akira Shibata, and Wouter Verkerke. HistFactory: A tool for creating statistical models for use with RooFit and RooStats. Technical Report CERN-OPEN-2012-016, New York U., New York, Jan 2012. URL: https://cds.cern.ch/record/1456844.

# FundamentalsÂ¶

Notebooks:

```
[1]:
```

```
%pylab inline
from ipywidgets import interact
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
```

```
Populating the interactive namespace from numpy and matplotlib
```

## Piecewise Linear InterpolationÂ¶

References: https://cds.cern.ch/record/1456844/files/CERN-OPEN-2012-016.pdf

We wish to understand interpolation using the piecewise linear function. This is `interpcode=0`

in the above reference. This function is defined as (nb: vector denotes bold)

with

In this notebook, weâ€™ll demonstrate the technical implementation of these interplations starting from simple dimensionality and increasing the dimensions as we go along. In all situations, weâ€™ll consider a single systematic that we wish to interpolate, such as Jet Energy Scale (JES).

Letâ€™s define the interpolate function. This function will produce the deltas we would like to calculate and sum with the nominal measurement to determine the interpolated measurements value.

```
[2]:
```

```
def interpolate_deltas(down,nom,up,alpha):
delta_up = up - nom
delta_down = nom - down
if alpha > 0:
return delta_up*alpha
else:
return delta_down*alpha
```

Why are we calculating deltas? This is some additional foresight that you, the reader, may not have yet. Multiple interpolation schemes exist but they all rely on calculating the change with respect to the nominal measurement (the delta).

### Case 1: The Single-binned HistogramÂ¶

Letâ€™s first start with considering evaluating the total number of events after applying JES corrections. This is the single-bin case. Code that runs through event selection will vary the JES parameter and provide three histograms, each with a single bin. These three histograms represent the nominal-, up-, and down- variations of the JES nuisance parameter.

When processing, we find that there are 10 events nominally, and when we vary the JES parameter downwards, we only measure 8 events. When varying upwards, we measure 15 events.

```
[3]:
```

```
down_1 = np.array([8])
nom_1 = np.array([10])
up_1 = np.array([15])
```

We would like to generate a function \(f(\alpha_\text{JES})\) that linearly interpolates the number of events for us so we can scan the phase-space for calculating PDFs. The `interpolate_deltas()`

function defined above does this for us.

```
[4]:
```

```
alphas = np.linspace(-1.,1.)
deltas = [interpolate_deltas(down_1, nom_1, up_1, alpha) for alpha in alphas]
deltas[:5]
```

```
[4]:
```

```
[array([-2.]),
array([-1.91836735]),
array([-1.83673469]),
array([-1.75510204]),
array([-1.67346939])]
```

So now that weâ€™ve generated the deltas from the nominal measurement, we can plot this to see how the linear interpolation works in the single-bin case, where we plot the measured values in black, and the interpolation in dashed, blue.

```
[5]:
```

```
plt.plot(alphas,[nom_1+delta for delta in deltas], linestyle='--')
plt.scatter((-1,0,1),(down_1,nom_1,up_1), color='k')
plt.xlabel(r'$\alpha_\mathrm{JES}$')
plt.ylabel(r'Events')
```

```
[5]:
```

```
Text(0,0.5,'Events')
```

Here, we can imagine building a 1-dimensional tensor (column-vector) of measurements as a function of \(\alpha_\text{JES}\) with each row in the column vector corresponding to a given \(\alpha_\text{JES}\) value.

### Case 2: The Multi-binned HistogramÂ¶

Now, letâ€™s increase the computational difficulty a little by increasing the dimensionality. Assume instead of a single-bin measurement, we have more measurements! We are good physicists after all. Imagine continuing on the previous example, where we add more bins, perhaps because we got more data. Imagine that this was binned by collection year, where we observed 10 events in the first year, 10.5 the next year, and so onâ€¦

```
[6]:
```

```
down_hist = np.linspace(8,10,11)
nom_hist = np.linspace(10,13,11)
up_hist = np.linspace(15,20,11)
```

Now, we still need to interpolate. Just like before, we have varied JES upwards and downwards to determine the corresponding histograms of variations. In order to interpolate, we need to interpolate by bin for each bin in the three histograms we have here (or three measurements if you prefer).

Letâ€™s go ahead and plot these histograms as a function of the bin index with black as the nominal measurements, red and blue as the down and up variations respectively. The black points are the measurements we have, and for each bin, we would like to interpolate to get an interpolated histogram that represents the measurement as a function of \(\alpha_\text{JES}\).

```
[7]:
```

```
def plot_measurements(down_hist, nom_hist, up_hist):
bincenters = np.arange(len(nom_hist))
for i, h in enumerate(zip(up_hist, nom_hist, down_hist)):
plt.scatter([i]*len(h), h, color='k', alpha=0.5)
for c,h in zip(['r','k','b'],[down_hist, nom_hist, up_hist]):
plt.plot(bincenters, h, color=c, linestyle='-', alpha=0.5)
plt.xlabel('Bin index in histogram')
plt.ylabel('Events')
plot_measurements(down_hist, nom_hist, up_hist)
```

What does this look like if we evaluate at a single \(\alpha_\text{JES} = 0.5\)? Weâ€™ll write a function that interpolates and then plots the interpolated values as a function of bin index, in green, dashed.

```
[8]:
```

```
def plot_interpolated_histogram(alpha, down_hist, nom_hist, up_hist):
bincenters = np.arange(len(nom_hist))
interpolated_vals = [nominal + interpolate_deltas(down, nominal, up, alpha) for down, nominal, up in zip(down_hist,nom_hist,up_hist)]
plot_measurements(down_hist, nom_hist, up_hist)
plt.plot(bincenters,interpolated_vals, color = 'g', linestyle='--')
plot_interpolated_histogram(0.5, down_hist, nom_hist, up_hist)
```

We can go one step further in visualization and see what it looks like for different \(\alpha_\text{JES}\) using iPyWidgetâ€™s interactivity. Change the slider to get an idea of how the interpolation works.

```
[9]:
```

```
x = interact(lambda alpha: plot_interpolated_histogram(alpha, down_hist, nom_hist, up_hist), alpha = (-1,1,0.1))
```

The magic in `plot_interpolated_histogram()`

happens to be that for a given \(\alpha_\text{JES}\), we iterate over all measurements bin-by-bin to calculate the interpolated value

```
[nominal + interpolate_deltas(down, nominal, up, alpha) for down, nominal, up in zip(...hists...)]
```

So you can imagine that weâ€™re building up a 2-dimensional tensor with each row corresponding to a different \(\alpha_\text{JES}\) and each column corresponding to the bin index of the histograms (or measurements). Letâ€™s go ahead and build a 3-dimensional representation of our understanding so far!

```
[10]:
```

```
def interpolate_alpha_range(alphas, down_hist, nom_hist, up_hist):
at_alphas = []
for alpha in alphas:
interpolated_hist_at_alpha = [nominal + interpolate_deltas(down, nominal, up, alpha) for down, nominal, up in zip(down_hist, nom_hist, up_hist)]
at_alphas.append(interpolated_hist_at_alpha)
return np.array(at_alphas)
```

And then with this, we are interpolating over all histograms bin-by-bin and producing a 2-dimensional tensor with each row corresponding to a specific value of \(\alpha_\text{JES}\).

```
[11]:
```

```
alphas = np.linspace(-1,1,11)
interpolated_vals_at_alphas = interpolate_alpha_range(alphas, down_hist, nom_hist, up_hist)
print(interpolated_vals_at_alphas[alphas==-1])
print(interpolated_vals_at_alphas[alphas==0])
print(interpolated_vals_at_alphas[alphas==1])
```

```
[[ 8. 8.2 8.4 8.6 8.8 9. 9.2 9.4 9.6 9.8 10. ]]
[[10. 10.3 10.6 10.9 11.2 11.5 11.8 12.1 12.4 12.7 13. ]]
[[15. 15.5 16. 16.5 17. 17.5 18. 18.5 19. 19.5 20. ]]
```

We have a way to generate the 2-dimensional tensor. Letâ€™s go ahead and add in all dimensions. Additionally, weâ€™ll add in some extra code to show the projection of the 2-d plots that we made earlier to help understand the 3-d plot a bit better. Like before, letâ€™s plot specifically colored lines for \(\alpha_\text{JES}=0.5\) as well as provide an interactive session.

```
[13]:
```

```
def plot_wire(alpha):
alphas = np.linspace(-1,1,51)
at_alphas = interpolate_alpha_range(alphas, down_hist, nom_hist, up_hist)
bincenters = np.arange(len(nom_hist))
x,y = np.meshgrid(bincenters,alphas)
z = np.asarray(at_alphas)
bottom = np.zeros_like(x)
fig = plt.figure(figsize=(10, 10))
ax1 = fig.add_subplot(111, projection='3d')
ax1.plot_wireframe(x, y, z, alpha = 0.3)
x,y = np.meshgrid(bincenters,[alpha])
z = interpolate_alpha_range([alpha], down_hist, nom_hist, up_hist)
ax1.plot_wireframe(x, y, z, edgecolor = 'g', linestyle='--')
ax1.set_xlim(0,10)
ax1.set_ylim(-1.0,1.5)
ax1.set_zlim(0,25)
ax1.view_init(azim=-125)
ax1.set_xlabel('Bin Index')
ax1.set_ylabel(r'$\alpha_\mathrm{JES}$')
ax1.set_zlabel('Events')
# add in 2D plot goodness
for c,h,zs in zip(['r','k','b'],[down_hist, nom_hist, up_hist],[-1.0,0.0,1.0]):
ax1.plot(bincenters, h, color=c, linestyle='-', alpha=0.5, zdir='y', zs=zs)
ax1.plot(bincenters, h, color=c, linestyle='-', alpha=0.25, zdir='y', zs=1.5)
ax1.plot(bincenters, z.T, color = 'g', linestyle='--', zdir='y', zs=alpha)
ax1.plot(bincenters, z.T, color = 'g', linestyle='--', alpha=0.5, zdir='y', zs=1.5)
plt.show()
plot_wire(0.5)
interact(plot_wire,alpha = (-1,1,0.1))
```

```
[13]:
```

```
<function __main__.plot_wire>
```

## Tensorizing InterpolatorsÂ¶

This notebook will introduce some tensor algebra concepts about being able to convert from calculations inside for-loops into a single calculation over the entire tensor. It is assumed that you have some familiarity with what interpolation functions are used for in `pyhf`

.

To get started, weâ€™ll load up some functions we wrote whose job is to generate sets of histograms and alphas that we will compute interpolations for. This allows us to generate random, structured input data that we can use to test the tensorized form of the interpolation function against the original one we wrote. For now, we will consider only the `numpy`

backend for simplicity, but can replace `np`

to `pyhf.tensorlib`

to achieve identical functionality.

The function `random_histosets_alphasets_pair`

will produce a pair `(histogramsets, alphasets)`

of histograms and alphas for those histograms that represents the type of input we wish to interpolate on.

```
[1]:
```

```
import numpy as np
def random_histosets_alphasets_pair(nsysts = 150, nhistos_per_syst_upto = 300, nalphas = 1, nbins_upto = 1):
def generate_shapes(histogramssets,alphasets):
h_shape = [len(histogramssets),0,0,0]
a_shape = (len(alphasets),max(map(len,alphasets)))
for hs in histogramssets:
h_shape[1] = max(h_shape[1],len(hs))
for h in hs:
h_shape[2] = max(h_shape[2],len(h))
for sh in h:
h_shape[3] = max(h_shape[3],len(sh))
return tuple(h_shape),a_shape
def filled_shapes(histogramssets,alphasets):
# pad our shapes with NaNs
histos, alphas = generate_shapes(histogramssets,alphasets)
histos, alphas = np.ones(histos) * np.nan, np.ones(alphas) * np.nan
for i,syst in enumerate(histogramssets):
for j,sample in enumerate(syst):
for k,variation in enumerate(sample):
histos[i,j,k,:len(variation)] = variation
for i,alphaset in enumerate(alphasets):
alphas[i,:len(alphaset)] = alphaset
return histos,alphas
nsyst_histos = np.random.randint(1, 1+nhistos_per_syst_upto, size=nsysts)
nhistograms = [np.random.randint(1, nbins_upto+1, size=n) for n in nsyst_histos]
random_alphas = [np.random.uniform(-1, 1,size=nalphas) for n in nsyst_histos]
random_histogramssets = [
[# all histos affected by systematic $nh
[# sample $i, systematic $nh
np.random.uniform(10*i+j,10*i+j+1, size = nbin).tolist() for j in range(3)
] for i,nbin in enumerate(nh)
] for nh in nhistograms
]
h,a = filled_shapes(random_histogramssets,random_alphas)
return h,a
```

### The (slow) interpolationsÂ¶

In all cases, the way we do interpolations is as follows:

Loop over both the

`histogramssets`

and`alphasets`

simultaneously (e.g. using pythonâ€™s`zip()`

)Loop over all histograms set in the set of histograms sets that correspond to the histograms affected by a given systematic

Loop over all of the alphas in the set of alphas

Loop over all the bins in the histogram sets simultaneously (e.g. using pythonâ€™s

`zip()`

)Apply the interpolation across the same bin index

This is already exhausting to think about, so letâ€™s put this in code form. Depending on the kind of interpolation being done, weâ€™ll pass in `func`

as an argument to the top-level interpolation loop to switch between linear (`interpcode=0`

) and non-linear (`interpcode=1`

).

```
[2]:
```

```
def interpolation_looper(histogramssets, alphasets, func):
all_results = []
for histoset, alphaset in zip(histogramssets, alphasets):
all_results.append([])
set_result = all_results[-1]
for histo in histoset:
set_result.append([])
histo_result = set_result[-1]
for alpha in alphaset:
alpha_result = []
for down,nom,up in zip(histo[0],histo[1],histo[2]):
v = func(down, nom, up, alpha)
alpha_result.append(v)
histo_result.append(alpha_result)
return all_results
```

And we can also define our linear and non-linear interpolations weâ€™ll consider in this notebook that we wish to tensorize.

```
[3]:
```

```
def interpolation_linear(histogramssets,alphasets):
def summand(down, nom, up, alpha):
delta_up = up - nom
delta_down = nom - down
if alpha > 0:
delta = delta_up*alpha
else:
delta = delta_down*alpha
return nom + delta
return interpolation_looper(histogramssets, alphasets, summand)
def interpolation_nonlinear(histogramssets,alphasets):
def product(down, nom, up, alpha):
delta_up = up/nom
delta_down = down/nom
if alpha > 0:
delta = delta_up**alpha
else:
delta = delta_down**(-alpha)
return nom*delta
return interpolation_looper(histogramssets, alphasets, product)
```

We will also define a helper function that allows us to pass in two functions we wish to compare the outputs for:

```
[4]:
```

```
def compare_fns(func1, func2):
h,a = random_histosets_alphasets_pair()
def _func_runner(func, histssets, alphasets):
return np.asarray(func(histssets,alphasets))
old = _func_runner(func1, h, a)
new = _func_runner(func2, h, a)
return (np.all(old[~np.isnan(old)] == new[~np.isnan(new)]), (h,a))
```

For the rest of the notebook, we will detail in explicit form how the linear interpolator gets tensorized, step-by-step. The same sequence of steps will be shown for the non-linear interpolator â€“ but it is left up to the reader to understand the steps.

### Tensorizing the Linear InterpolatorÂ¶

#### Step 0Â¶

Step 0 requires converting the innermost conditional check on `alpha > 0`

into something tensorizable. This also means the calculation itself is going to become tensorized. So we will convert from

```
if alpha > 0:
delta = delta_up*alpha
else:
delta = delta_down*alpha
```

to

```
delta = np.where(alpha > 0, delta_up*alpha, delta_down*alpha)
```

Letâ€™s make that change now, and letâ€™s check to make sure we still do the calculation correctly.

```
[5]:
```

```
# get the internal calculation to use tensorlib backend
def new_interpolation_linear_step0(histogramssets,alphasets):
all_results = []
for histoset, alphaset in zip(histogramssets,alphasets):
all_results.append([])
set_result = all_results[-1]
for histo in histoset:
set_result.append([])
histo_result = set_result[-1]
for alpha in alphaset:
alpha_result = []
for down,nom,up in zip(histo[0],histo[1],histo[2]):
delta_up = up - nom
delta_down = nom - down
delta = np.where(alpha > 0, delta_up*alpha, delta_down*alpha)
v = nom + delta
alpha_result.append(v)
histo_result.append(alpha_result)
return all_results
```

And does the calculation still match?

```
[6]:
```

```
result, (h,a) = compare_fns(interpolation_linear, new_interpolation_linear_step0)
print(result)
```

```
True
```

```
[7]:
```

```
%%timeit
interpolation_linear(h,a)
```

```
189 ms Â± 6.14 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
```

```
[8]:
```

```
%%timeit
new_interpolation_linear_step0(h,a)
```

```
255 ms Â± 11.7 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
```

Great! Weâ€™re a little bit slower right now, but thatâ€™s expected. Weâ€™re just getting started.

#### Step 1Â¶

In this step, we would like to remove the innermost `zip()`

call over the histogram bins by calculating the interpolation between the histograms in one fell swoop. This means, instead of writing something like

```
for down,nom,up in zip(histo[0],histo[1],histo[2]):
delta_up = up - nom
...
```

one can instead write

```
delta_up = histo[2] - histo[1]
...
```

taking advantage of the automatic broadcasting of operations on input tensors. This sort of feature of the tensor backends allows us to speed up code, such as interpolation.

```
[9]:
```

```
# update the delta variations to remove the zip() call and remove most-nested loop
def new_interpolation_linear_step1(histogramssets,alphasets):
all_results = []
for histoset, alphaset in zip(histogramssets,alphasets):
all_results.append([])
set_result = all_results[-1]
for histo in histoset:
set_result.append([])
histo_result = set_result[-1]
for alpha in alphaset:
alpha_result = []
deltas_up = histo[2]-histo[1]
deltas_dn = histo[1]-histo[0]
calc_deltas = np.where(alpha > 0, deltas_up*alpha, deltas_dn*alpha)
v = histo[1] + calc_deltas
alpha_result.append(v)
histo_result.append(alpha_result)
return all_results
```

And does the calculation still match?

```
[10]:
```

```
result, (h,a) = compare_fns(interpolation_linear, new_interpolation_linear_step1)
print(result)
```

```
True
```

```
[11]:
```

```
%%timeit
interpolation_linear(h,a)
```

```
188 ms Â± 7.14 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
```

```
[12]:
```

```
%%timeit
new_interpolation_linear_step1(h,a)
```

```
492 ms Â± 42.8 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
```

Great!

#### Step 2Â¶

In this step, we would like to move the giant array of the deltas calculated to the beginning â€“ outside of all loops â€“ and then only take a subset of it for the calculation itself. This allows us to figure out the entire structure of the input for the rest of the calculations as we slowly move towards including `einsum()`

calls (einstein summation). This means we would like to go from

```
for histo in histoset:
delta_up = histo[2] - histo[1]
...
```

to

```
all_deltas = ...
for nh, histo in enumerate(histoset):
deltas = all_deltas[nh]
...
```

Again, we are taking advantage of the automatic broadcasting of operations on input tensors to calculate all the deltas in a single action.

```
[13]:
```

```
# figure out the giant array of all deltas at the beginning and only take subsets of it for the calculation
def new_interpolation_linear_step2(histogramssets,alphasets):
all_results = []
allset_all_histo_deltas_up = histogramssets[:,:,2] - histogramssets[:,:,1]
allset_all_histo_deltas_dn = histogramssets[:,:,1] - histogramssets[:,:,0]
for nset,(histoset, alphaset) in enumerate(zip(histogramssets,alphasets)):
set_result = []
all_histo_deltas_up = allset_all_histo_deltas_up[nset]
all_histo_deltas_dn = allset_all_histo_deltas_dn[nset]
for nh,histo in enumerate(histoset):
alpha_deltas = []
for alpha in alphaset:
alpha_result = []
deltas_up = all_histo_deltas_up[nh]
deltas_dn = all_histo_deltas_dn[nh]
calc_deltas = np.where(alpha > 0, deltas_up*alpha, deltas_dn*alpha)
alpha_deltas.append(calc_deltas)
set_result.append([histo[1]+ d for d in alpha_deltas])
all_results.append(set_result)
return all_results
```

And does the calculation still match?

```
[14]:
```

```
result, (h,a) = compare_fns(interpolation_linear, new_interpolation_linear_step2)
print(result)
```

```
True
```

```
[15]:
```

```
%%timeit
interpolation_linear(h,a)
```

```
179 ms Â± 12.4 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[16]:
```

```
%%timeit
new_interpolation_linear_step2(h,a)
```

```
409 ms Â± 20.8 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
```

Great!

#### Step 3Â¶

In this step, we get to introduce einstein summation to generalize the calculations we perform across many dimensions in a more concise, straightforward way. See this blog post for some more details on einstein summation notation. In short, it allows us to write

in a much more elegant way to express many kinds of common tensor operations such as dot products, transposes, outer products, and so on. This step is generally the hardest as one needs to figure out the corresponding `einsum`

that keeps the calculation preserved (and matching). To some extent it requires a lot of trial and error until you get a feel for how einstein summation notation works.

As a concrete example of a conversion, we wish to go from something like

```
for nh,histo in enumerate(histoset):
for alpha in alphaset:
deltas_up = all_histo_deltas_up[nh]
deltas_dn = all_histo_deltas_dn[nh]
calc_deltas = np.where(alpha > 0, deltas_up*alpha, deltas_dn*alpha)
...
```

to get rid of the loop over `alpha`

```
for nh,histo in enumerate(histoset):
alphas_times_deltas_up = np.einsum('i,j->ij',alphaset,all_histo_deltas_up[nh])
alphas_times_deltas_dn = np.einsum('i,j->ij',alphaset,all_histo_deltas_dn[nh])
masks = np.einsum('i,j->ij',alphaset > 0,np.ones_like(all_histo_deltas_dn[nh]))
alpha_deltas = np.where(masks,alphas_times_deltas_up, alphas_times_deltas_dn)
...
```

In this particular case, we need an outer product that multiplies across the `alphaset`

to the corresponding `histoset`

for the up/down variations. Then we just need to select from either the up variation calculation or the down variation calculation based on the sign of alpha. Try to convince yourself that the einstein summation does what the for-loop does, but a little bit more concisely, and perhaps more clearly! How does the function look now?

```
[17]:
```

```
# remove the loop over alphas, starts using einsum to help generalize to more dimensions
def new_interpolation_linear_step3(histogramssets,alphasets):
all_results = []
allset_all_histo_deltas_up = histogramssets[:,:,2] - histogramssets[:,:,1]
allset_all_histo_deltas_dn = histogramssets[:,:,1] - histogramssets[:,:,0]
for nset,(histoset, alphaset) in enumerate(zip(histogramssets,alphasets)):
set_result = []
all_histo_deltas_up = allset_all_histo_deltas_up[nset]
all_histo_deltas_dn = allset_all_histo_deltas_dn[nset]
for nh,histo in enumerate(histoset):
alphas_times_deltas_up = np.einsum('i,j->ij',alphaset,all_histo_deltas_up[nh])
alphas_times_deltas_dn = np.einsum('i,j->ij',alphaset,all_histo_deltas_dn[nh])
masks = np.einsum('i,j->ij',alphaset > 0,np.ones_like(all_histo_deltas_dn[nh]))
alpha_deltas = np.where(masks,alphas_times_deltas_up, alphas_times_deltas_dn)
set_result.append([histo[1]+ d for d in alpha_deltas])
all_results.append(set_result)
return all_results
```

And does the calculation still match?

```
[18]:
```

```
result, (h,a) = compare_fns(interpolation_linear, new_interpolation_linear_step3)
print(result)
```

```
True
```

```
[19]:
```

```
%%timeit
interpolation_linear(h,a)
```

```
166 ms Â± 11.6 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[20]:
```

```
%%timeit
new_interpolation_linear_step3(h,a)
```

```
921 ms Â± 133 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
```

Great! Note that weâ€™ve been getting a little bit slower during these steps. It will all pay off in the end when weâ€™re fully tensorized! A lot of the internal steps are overkill with the heavy einstein summation and broadcasting at the moment, especially for how many loops in we are.

#### Step 4Â¶

Now in this step, we will move the einstein summations to the outer loop, so that weâ€™re calculating it once! This is the big step, but a little bit easier because all weâ€™re doing is adding extra dimensions into the calculation. The underlying calculation wonâ€™t have changed. At this point, weâ€™ll also rename from `i`

and `j`

to `a`

and `b`

for `alpha`

and `bin`

(as in the bin in the histogram). To continue the notation as well, hereâ€™s a summary of the dimensions involved:

`s`

will be for the set under consideration (e.g. the modifier)`a`

will be for the alpha variation`h`

will be for the histogram affected by the modifier`b`

will be for the bin of the histogram

So we wish to move the `einsum`

code from

```
for nset,(histoset, alphaset) in enumerate(zip(histogramssets,alphasets)):
...
for nh,histo in enumerate(histoset):
alphas_times_deltas_up = np.einsum('i,j->ij',alphaset,all_histo_deltas_up[nh])
...
```

to

```
all_alphas_times_deltas_up = np.einsum('...',alphaset,all_histo_deltas_up)
for nset,(histoset, alphaset) in enumerate(zip(histogramssets,alphasets)):
...
for nh,histo in enumerate(histoset):
...
```

So how does this new function look?

```
[21]:
```

```
# move the einsums to outer loops to get ready to get rid of all loops
def new_interpolation_linear_step4(histogramssets,alphasets):
allset_all_histo_deltas_up = histogramssets[:,:,2] - histogramssets[:,:,1]
allset_all_histo_deltas_dn = histogramssets[:,:,1] - histogramssets[:,:,0]
allset_all_histo_nom = histogramssets[:,:,1]
allsets_all_histos_alphas_times_deltas_up = np.einsum('sa,shb->shab',alphasets,allset_all_histo_deltas_up)
allsets_all_histos_alphas_times_deltas_dn = np.einsum('sa,shb->shab',alphasets,allset_all_histo_deltas_dn)
allsets_all_histos_masks = np.einsum('sa,s...u->s...au',alphasets > 0,np.ones_like(allset_all_histo_deltas_dn))
allsets_all_histos_deltas = np.where(allsets_all_histos_masks,allsets_all_histos_alphas_times_deltas_up, allsets_all_histos_alphas_times_deltas_dn)
all_results = []
for nset,histoset in enumerate(histogramssets):
all_histos_deltas = allsets_all_histos_deltas[nset]
set_result = []
for nh,histo in enumerate(histoset):
set_result.append([d + histoset[nh,1] for d in all_histos_deltas[nh]])
all_results.append(set_result)
return all_results
```

And does the calculation still match?

```
[22]:
```

```
result, (h,a) = compare_fns(interpolation_linear, new_interpolation_linear_step4)
print(result)
```

```
True
```

```
[23]:
```

```
%%timeit
interpolation_linear(h,a)
```

```
160 ms Â± 5 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[24]:
```

```
%%timeit
new_interpolation_linear_step4(h,a)
```

```
119 ms Â± 3.19 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

Great! And look at that huge speed up in time already, just from moving the multiple, heavy einstein summation calculations up through the loops. We still have some more optimizing to do as we still have explicit loops in our code. Letâ€™s keep at it, weâ€™re almost there!

#### Step 5Â¶

The hard part is mostly over. We have to now think about the nominal variations. Recall that we were trying to add the nominals to the deltas in order to compute the new value. In practice, weâ€™ll return the delta variation only, but weâ€™ll show you how to get rid of this last loop. In this case, we want to figure out how to change code like

```
all_results = []
for nset,histoset in enumerate(histogramssets):
all_histos_deltas = allsets_all_histos_deltas[nset]
set_result = []
for nh,histo in enumerate(histoset):
set_result.append([d + histoset[nh,1] for d in all_histos_deltas[nh]])
all_results.append(set_result)
```

to get rid of that most-nested loop

```
all_results = []
for nset,histoset in enumerate(histogramssets):
# look ma, no more loops inside!
```

So how does this look?

```
[25]:
```

```
# slowly getting rid of our loops to build the right output tensor -- gotta think about nominals
def new_interpolation_linear_step5(histogramssets,alphasets):
allset_all_histo_deltas_up = histogramssets[:,:,2] - histogramssets[:,:,1]
allset_all_histo_deltas_dn = histogramssets[:,:,1] - histogramssets[:,:,0]
allset_all_histo_nom = histogramssets[:,:,1]
allsets_all_histos_alphas_times_deltas_up = np.einsum('sa,shb->shab',alphasets,allset_all_histo_deltas_up)
allsets_all_histos_alphas_times_deltas_dn = np.einsum('sa,shb->shab',alphasets,allset_all_histo_deltas_dn)
allsets_all_histos_masks = np.einsum('sa,s...u->s...au',alphasets > 0,np.ones_like(allset_all_histo_deltas_dn))
allsets_all_histos_deltas = np.where(allsets_all_histos_masks,allsets_all_histos_alphas_times_deltas_up, allsets_all_histos_alphas_times_deltas_dn)
all_results = []
for nset,(_,alphaset) in enumerate(zip(histogramssets,alphasets)):
all_histos_deltas = allsets_all_histos_deltas[nset]
noms = histogramssets[nset,:,1]
all_histos_noms_repeated = np.einsum('a,hn->han',np.ones_like(alphaset),noms)
set_result = all_histos_deltas + all_histos_noms_repeated
all_results.append(set_result)
return all_results
```

And does the calculation still match?

```
[26]:
```

```
result, (h,a) = compare_fns(interpolation_linear, new_interpolation_linear_step5)
print(result)
```

```
True
```

```
[27]:
```

```
%%timeit
interpolation_linear(h,a)
```

```
160 ms Â± 8.28 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[28]:
```

```
%%timeit
new_interpolation_linear_step5(h,a)
```

```
1.57 ms Â± 75.2 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)
```

Fantastic! And look at the speed up. Weâ€™re already faster than the for-loop and weâ€™re not even done yet.

#### Step 6Â¶

The final frontier. Also probably the best Star Wars episode. In any case, we have one more for-loop that needs to die in a slab of carbonite. This should be much easier now that youâ€™re more comfortable with tensor broadcasting and einstein summations.

What does the function look like now?

```
[29]:
```

```
def new_interpolation_linear_step6(histogramssets,alphasets):
allset_allhisto_deltas_up = histogramssets[:,:,2] - histogramssets[:,:,1]
allset_allhisto_deltas_dn = histogramssets[:,:,1] - histogramssets[:,:,0]
allset_allhisto_nom = histogramssets[:,:,1]
#x is dummy index
allsets_allhistos_alphas_times_deltas_up = np.einsum('sa,shb->shab',alphasets,allset_allhisto_deltas_up)
allsets_allhistos_alphas_times_deltas_dn = np.einsum('sa,shb->shab',alphasets,allset_allhisto_deltas_dn)
allsets_allhistos_masks = np.einsum('sa,sxu->sxau',np.where(alphasets > 0, np.ones(alphasets.shape), np.zeros(alphasets.shape)),np.ones(allset_allhisto_deltas_dn.shape))
allsets_allhistos_deltas = np.where(allsets_allhistos_masks,allsets_allhistos_alphas_times_deltas_up, allsets_allhistos_alphas_times_deltas_dn)
allsets_allhistos_noms_repeated = np.einsum('sa,shb->shab',np.ones(alphasets.shape),allset_allhisto_nom)
set_results = allsets_allhistos_deltas + allsets_allhistos_noms_repeated
return set_results
```

And does the calculation still match?

```
[30]:
```

```
result, (h,a) = compare_fns(interpolation_linear, new_interpolation_linear_step6)
print(result)
```

```
True
```

```
[31]:
```

```
%%timeit
interpolation_linear(h,a)
```

```
156 ms Â± 6.29 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[32]:
```

```
%%timeit
new_interpolation_linear_step6(h,a)
```

```
468 Âµs Â± 37.1 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)
```

And weâ€™re done tensorizing it. There are some more improvements that could be made to make this interpolation calculation even more robust â€“ but for now weâ€™re done.

### Tensorizing the Non-Linear InterpolatorÂ¶

This is very, very similar to what weâ€™ve done for the case of the linear interpolator. As such, we will provide the resulting functions for each step, and you can see how things perform all the way at the bottom. Enjoy and learn at your own pace!

```
[33]:
```

```
def interpolation_nonlinear(histogramssets,alphasets):
all_results = []
for histoset, alphaset in zip(histogramssets,alphasets):
all_results.append([])
set_result = all_results[-1]
for histo in histoset:
set_result.append([])
histo_result = set_result[-1]
for alpha in alphaset:
alpha_result = []
for down,nom,up in zip(histo[0],histo[1],histo[2]):
delta_up = up/nom
delta_down = down/nom
if alpha > 0:
delta = delta_up**alpha
else:
delta = delta_down**(-alpha)
v = nom*delta
alpha_result.append(v)
histo_result.append(alpha_result)
return all_results
def new_interpolation_nonlinear_step0(histogramssets,alphasets):
all_results = []
for histoset, alphaset in zip(histogramssets,alphasets):
all_results.append([])
set_result = all_results[-1]
for histo in histoset:
set_result.append([])
histo_result = set_result[-1]
for alpha in alphaset:
alpha_result = []
for down,nom,up in zip(histo[0],histo[1],histo[2]):
delta_up = up/nom
delta_down = down/nom
delta = np.where(alpha > 0, np.power(delta_up, alpha), np.power(delta_down, np.abs(alpha)))
v = nom*delta
alpha_result.append(v)
histo_result.append(alpha_result)
return all_results
def new_interpolation_nonlinear_step1(histogramssets,alphasets):
all_results = []
for histoset, alphaset in zip(histogramssets,alphasets):
all_results.append([])
set_result = all_results[-1]
for histo in histoset:
set_result.append([])
histo_result = set_result[-1]
for alpha in alphaset:
alpha_result = []
deltas_up = np.divide(histo[2], histo[1])
deltas_down = np.divide(histo[0], histo[1])
bases = np.where(alpha > 0, deltas_up, deltas_down)
exponents = np.abs(alpha)
calc_deltas = np.power(bases, exponents)
v = histo[1] * calc_deltas
alpha_result.append(v)
histo_result.append(alpha_result)
return all_results
def new_interpolation_nonlinear_step2(histogramssets,alphasets):
all_results = []
allset_all_histo_deltas_up = np.divide(histogramssets[:,:,2], histogramssets[:,:,1])
allset_all_histo_deltas_dn = np.divide(histogramssets[:,:,0], histogramssets[:,:,1])
for nset,(histoset, alphaset) in enumerate(zip(histogramssets,alphasets)):
set_result = []
all_histo_deltas_up = allset_all_histo_deltas_up[nset]
all_histo_deltas_dn = allset_all_histo_deltas_dn[nset]
for nh,histo in enumerate(histoset):
alpha_deltas = []
for alpha in alphaset:
alpha_result = []
deltas_up = all_histo_deltas_up[nh]
deltas_down = all_histo_deltas_dn[nh]
bases = np.where(alpha > 0, deltas_up, deltas_down)
exponents = np.abs(alpha)
calc_deltas = np.power(bases, exponents)
alpha_deltas.append(calc_deltas)
set_result.append([histo[1]*d for d in alpha_deltas])
all_results.append(set_result)
return all_results
def new_interpolation_nonlinear_step3(histogramssets,alphasets):
all_results = []
allset_all_histo_deltas_up = np.divide(histogramssets[:,:,2], histogramssets[:,:,1])
allset_all_histo_deltas_dn = np.divide(histogramssets[:,:,0], histogramssets[:,:,1])
for nset,(histoset, alphaset) in enumerate(zip(histogramssets,alphasets)):
set_result = []
all_histo_deltas_up = allset_all_histo_deltas_up[nset]
all_histo_deltas_dn = allset_all_histo_deltas_dn[nset]
for nh,histo in enumerate(histoset):
# bases and exponents need to have an outer product, to esentially tile or repeat over rows/cols
bases_up = np.einsum('a,b->ab', np.ones(alphaset.shape), all_histo_deltas_up[nh])
bases_dn = np.einsum('a,b->ab', np.ones(alphaset.shape), all_histo_deltas_dn[nh])
exponents = np.einsum('a,b->ab', np.abs(alphaset), np.ones(all_histo_deltas_up[nh].shape))
masks = np.einsum('a,b->ab',alphaset > 0,np.ones(all_histo_deltas_dn[nh].shape))
bases = np.where(masks, bases_up, bases_dn)
alpha_deltas = np.power(bases, exponents)
set_result.append([histo[1]*d for d in alpha_deltas])
all_results.append(set_result)
return all_results
def new_interpolation_nonlinear_step4(histogramssets,alphasets):
all_results = []
allset_all_histo_nom = histogramssets[:,:,1]
allset_all_histo_deltas_up = np.divide(histogramssets[:,:,2], allset_all_histo_nom)
allset_all_histo_deltas_dn = np.divide(histogramssets[:,:,0], allset_all_histo_nom)
bases_up = np.einsum('sa,shb->shab', np.ones(alphasets.shape), allset_all_histo_deltas_up)
bases_dn = np.einsum('sa,shb->shab', np.ones(alphasets.shape), allset_all_histo_deltas_dn)
exponents = np.einsum('sa,shb->shab', np.abs(alphasets), np.ones(allset_all_histo_deltas_up.shape))
masks = np.einsum('sa,shb->shab',alphasets > 0,np.ones(allset_all_histo_deltas_up.shape))
bases = np.where(masks, bases_up, bases_dn)
allsets_all_histos_deltas = np.power(bases, exponents)
all_results = []
for nset,histoset in enumerate(histogramssets):
all_histos_deltas = allsets_all_histos_deltas[nset]
set_result = []
for nh,histo in enumerate(histoset):
set_result.append([histoset[nh,1]*d for d in all_histos_deltas[nh]])
all_results.append(set_result)
return all_results
def new_interpolation_nonlinear_step5(histogramssets,alphasets):
all_results = []
allset_all_histo_nom = histogramssets[:,:,1]
allset_all_histo_deltas_up = np.divide(histogramssets[:,:,2], allset_all_histo_nom)
allset_all_histo_deltas_dn = np.divide(histogramssets[:,:,0], allset_all_histo_nom)
bases_up = np.einsum('sa,shb->shab', np.ones(alphasets.shape), allset_all_histo_deltas_up)
bases_dn = np.einsum('sa,shb->shab', np.ones(alphasets.shape), allset_all_histo_deltas_dn)
exponents = np.einsum('sa,shb->shab', np.abs(alphasets), np.ones(allset_all_histo_deltas_up.shape))
masks = np.einsum('sa,shb->shab',alphasets > 0,np.ones(allset_all_histo_deltas_up.shape))
bases = np.where(masks, bases_up, bases_dn)
allsets_all_histos_deltas = np.power(bases, exponents)
all_results = []
for nset,(_,alphaset) in enumerate(zip(histogramssets,alphasets)):
all_histos_deltas = allsets_all_histos_deltas[nset]
noms = allset_all_histo_nom[nset]
all_histos_noms_repeated = np.einsum('a,hn->han',np.ones_like(alphaset),noms)
set_result = all_histos_deltas * all_histos_noms_repeated
all_results.append(set_result)
return all_results
def new_interpolation_nonlinear_step6(histogramssets,alphasets):
all_results = []
allset_all_histo_nom = histogramssets[:,:,1]
allset_all_histo_deltas_up = np.divide(histogramssets[:,:,2], allset_all_histo_nom)
allset_all_histo_deltas_dn = np.divide(histogramssets[:,:,0], allset_all_histo_nom)
bases_up = np.einsum('sa,shb->shab', np.ones(alphasets.shape), allset_all_histo_deltas_up)
bases_dn = np.einsum('sa,shb->shab', np.ones(alphasets.shape), allset_all_histo_deltas_dn)
exponents = np.einsum('sa,shb->shab', np.abs(alphasets), np.ones(allset_all_histo_deltas_up.shape))
masks = np.einsum('sa,shb->shab',alphasets > 0,np.ones(allset_all_histo_deltas_up.shape))
bases = np.where(masks, bases_up, bases_dn)
allsets_all_histos_deltas = np.power(bases, exponents)
allsets_allhistos_noms_repeated = np.einsum('sa,shb->shab', np.ones(alphasets.shape), allset_all_histo_nom)
set_results = allsets_all_histos_deltas * allsets_allhistos_noms_repeated
return set_results
```

```
[34]:
```

```
result, (h,a) = compare_fns(interpolation_nonlinear, new_interpolation_nonlinear_step0)
print(result)
```

```
True
```

```
[35]:
```

```
%%timeit
interpolation_nonlinear(h,a)
```

```
149 ms Â± 9.45 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[36]:
```

```
%%timeit
new_interpolation_nonlinear_step0(h,a)
```

```
527 ms Â± 29.2 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
```

```
[37]:
```

```
result, (h,a) = compare_fns(interpolation_nonlinear, new_interpolation_nonlinear_step1)
print(result)
```

```
True
```

```
[38]:
```

```
%%timeit
interpolation_nonlinear(h,a)
```

```
150 ms Â± 5.21 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[39]:
```

```
%%timeit
new_interpolation_nonlinear_step1(h,a)
```

```
456 ms Â± 17.9 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
```

```
[40]:
```

```
result, (h,a) = compare_fns(interpolation_nonlinear, new_interpolation_nonlinear_step2)
print(result)
```

```
True
```

```
[41]:
```

```
%%timeit
interpolation_nonlinear(h,a)
```

```
154 ms Â± 4.49 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[42]:
```

```
%%timeit
new_interpolation_nonlinear_step2(h,a)
```

```
412 ms Â± 31 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
```

```
[43]:
```

```
result, (h,a) = compare_fns(interpolation_nonlinear, new_interpolation_nonlinear_step3)
print(result)
```

```
True
```

```
[44]:
```

```
%%timeit
interpolation_nonlinear(h,a)
```

```
145 ms Â± 5.15 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[45]:
```

```
%%timeit
new_interpolation_nonlinear_step3(h,a)
```

```
1.28 s Â± 74.4 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
```

```
[46]:
```

```
result, (h,a) = compare_fns(interpolation_nonlinear, new_interpolation_nonlinear_step4)
print(result)
```

```
True
```

```
[47]:
```

```
%%timeit
interpolation_nonlinear(h,a)
```

```
147 ms Â± 8.4 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[48]:
```

```
%%timeit
new_interpolation_nonlinear_step4(h,a)
```

```
120 ms Â± 3.06 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[49]:
```

```
result, (h,a) = compare_fns(interpolation_nonlinear, new_interpolation_nonlinear_step5)
print(result)
```

```
True
```

```
[50]:
```

```
%%timeit
interpolation_nonlinear(h,a)
```

```
151 ms Â± 5.29 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[51]:
```

```
%%timeit
new_interpolation_nonlinear_step5(h,a)
```

```
2.65 ms Â± 57.6 Âµs per loop (mean Â± std. dev. of 7 runs, 100 loops each)
```

```
[52]:
```

```
result, (h,a) = compare_fns(interpolation_nonlinear, new_interpolation_nonlinear_step6)
print(result)
```

```
True
```

```
[53]:
```

```
%%timeit
interpolation_nonlinear(h,a)
```

```
156 ms Â± 3.35 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
```

```
[54]:
```

```
%%timeit
new_interpolation_nonlinear_step6(h,a)
```

```
1.49 ms Â± 16 Âµs per loop (mean Â± std. dev. of 7 runs, 1000 loops each)
```

# ExamplesÂ¶

Notebooks:

## Hello World, `pyhf`

styleÂ¶

**Two bin counting experiment with a background uncertainty**

```
[1]:
```

```
import pyhf
```

**Returning the observed and expected** \(\mathrm{CL}_{s}\)

```
[2]:
```

```
pdf = pyhf.simplemodels.hepdata_like(signal_data=[12.0, 11.0], bkg_data=[50.0, 52.0], bkg_uncerts=[3.0, 7.0])
CLs_obs, CLs_exp = pyhf.utils.hypotest(1.0, [51, 48] + pdf.config.auxdata, pdf, return_expected=True)
print('Observed: {}, Expected: {}'.format(CLs_obs, CLs_exp))
```

```
Observed: [0.05290116], Expected: [0.06445521]
```

**Returning the observed** \(\mathrm{CL}_{s}\), \(\mathrm{CL}_{s+b}\), **and** \(\mathrm{CL}_{b}\)

```
[3]:
```

```
CLs_obs, p_values = pyhf.utils.hypotest(1.0, [51, 48] + pdf.config.auxdata, pdf, return_tail_probs=True)
print('Observed CL_s: {}, CL_sb: {}, CL_b: {}'.format(CLs_obs, p_values[0], p_values[1]))
```

```
Observed CL_s: [0.05290116], CL_sb: [0.0236], CL_b: [0.44611493]
```

A reminder that

```
[4]:
```

```
assert CLs_obs == p_values[0]/p_values[1]
```

**Returning the expected** \(\mathrm{CL}_{s}\) **band values**

```
[5]:
```

```
import numpy as np
```

```
[6]:
```

```
CLs_obs, CLs_exp_band = pyhf.utils.hypotest(1.0, [51, 48] + pdf.config.auxdata, pdf, return_expected_set=True)
print('Observed CL_s: {}\n'.format(CLs_obs))
for p_value, n_sigma in enumerate(np.arange(-2,3)):
print('Expected CL_s{}: {}'.format(' ' if n_sigma==0 else '({} Ď)'.format(n_sigma),CLs_exp_band[p_value]))
```

```
Observed CL_s: [0.05290116]
Expected CL_s(-2 Ď): [0.00260641]
Expected CL_s(-1 Ď): [0.01382066]
Expected CL_s : [0.06445521]
Expected CL_s(1 Ď): [0.23526104]
Expected CL_s(2 Ď): [0.57304182]
```

**Returning the test statistics for the observed and Asimov data**

```
[7]:
```

```
CLs_obs, test_statistics = pyhf.utils.hypotest(1.0, [51, 48] + pdf.config.auxdata, pdf, return_test_statistics=True)
print('q_mu: {}, Asimov q_mu: {}'.format(test_statistics[0], test_statistics[1]))
```

```
q_mu: [3.93824492], Asimov q_mu: [3.41886758]
```

```
[1]:
```

```
%pylab inline
```

```
Populating the interactive namespace from numpy and matplotlib
```

```
[2]:
```

```
import os
import pyhf
import pyhf.readxml
from ipywidgets import interact, fixed
```

## Binned HEP Statistical Analysis in PythonÂ¶

### HistFactoryÂ¶

HistFactory is a popular framework to analyze binned event data and commonly used in High Energy Physics. At its core it is a template for building a statistical model from individual binned distribution (â€Histogramsâ€™) and variations on them (â€Systematicsâ€™) that represent auxiliary measurements (for example an energy scale of the detector which affects the shape of a distribution)

### pyhfÂ¶

`pyhf`

is a work-in-progress standalone implementation of the HistFactory p.d.f. template and an implementation of the test statistics and asymptotic formulae described in the paper by Cowan, Cranmer, Gross, Vitells: *Asymptotic formulae for likelihood-based tests of new physics* [arxiv:1007.1727].

Models can be defined using JSON specification, but existing models based on the XML + ROOT file scheme are readable as well.

### The DemoÂ¶

The input data for the statistical analysis was built generated using the containerized workflow engine yadage (see demo from KubeCon 2018 [youtube]). Similarly to Binder this utilizes modern container technology for reproducible science. Below you see the execution graph leading up to the model input data at the bottom.

```
[3]:
```

```
import base64
from IPython.core.display import display, HTML
anim = base64.b64encode(open('workflow.gif','rb').read()).decode('ascii')
HTML('<img src="data:image/gif;base64,{}">'.format(anim))
```

```
[3]:
```