perfume: Interactive performance benchmarking in Jupyter¶
Contents:
perfume¶
Interactive performance benchmarking in Jupyter
- Free software: BSD license
- Documentation: https://perfume.readthedocs.io.
Overview¶
perfume is a performance benchmarking tool that provides quick feedback on the systems under test.
The primary goals are:
- Prioritize analysis of distributions of latency, not averages.
- Support both immediate feedback and robust benchmarking with many samples, through a UI that updates as we collect more information.
- Provide raw data back to the user, for flexible custom analysis.
- Provide helpful post-processing analysis and charting tools.
Features¶
- Live-updating histogram chart and descriptive statistics during a benchmark run.
- Jupyter notebook integration.
- Tunable benchmarking overhead.
- Comparative analysis of multiple functions under test.
- Powerful post-processing analysis tools.
Installing¶
pip install perfume-bench
Credits¶
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.
Installation¶
Stable release¶
To install perfume, run this command in your terminal:
$ pip install perfume-bench
This is the preferred method to install perfume, as it will always install the most recent stable release.
If you don’t have pip installed, this Python installation guide can guide you through the process.
From sources¶
The sources for perfume can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone https://github.com/leifwalsh/perfume
Or download the tarball:
$ curl -OL https://github.com/leifwalsh/perfume/tarball/master
Once you have a copy of the source, you can install it with:
$ python setup.py install
Usage¶
To use perfume in a project:
import perfume
# In a Jupyter notebook, you'll want these to get the plots to
# display inline
import bokeh.io
bokeh.io.output_notebook()
Collecting samples¶
The entry point to perfume
is perfume.bench()
, which
takes a list of Python functions, and benchmarks them together. It
will display, and update:
- A plot showing histograms of sampled latencies for each function, with a kernel density estimator and 25th, 50th, and 75th percentiles.
- A table of
pandas.DataFrame.describe()
output for the sampled timings. - Two pairwise Kolmogorov-Smirnov test tables, one on the raw samples, the other more sensitive to outliers (more on how later).

Analyzing results¶
When you run perfume.bench()
, it returns a
DataFrame
of the samples, represented as wall-clock
begin
and end
times. There are several built-in ways to
interpret these, collected in perfume.analyze
, a few examples
here:
perfume.analyze.timings()
interprets thebegin
/end
times in the samplesDataFrame
and computes the differences betweenend
andbegin
, giving you a newDataFrame
containing elapsed time values.perfume.analyze.isolate()
adjusts each function’sbegin
/end
times to be as if that function was benchmarked in isolation, so eachbegin
matches the previousend
. This can be interpreted as “time to simulated completion” of a given fixed-size workload.perfume.analyze.ks_test()
runs the Kolmogorov-Smirnov test across the benchmarked functions (given output fromperfume.analyze.timings()
), giving pairwise results. The 2-sample K-S test is a measure of how different two distributions are (briefly, it’s the largest vertical distance between their ECDFs). For a function you think you’ve optimized, you can run the old and new versions and use the K-S test to get a sense for how likely it is that you’ve made a consistent improvement.perfume.analyze.cumulative_quantiles_plot()
creates a plot over time of the median, upper and lower quantiles, and min and max, for all samples collected up until that point in simulation time. Each function is charted with its timings in isolation (seeperfume.analyze.isolate()
), so faster functions cover less of the x-axis. For example:
See perfume.analyze
for the full set of analysis tools.
Example notebook¶
In [1]:
import perfume
import perfume.analyze
import pandas as pd
import bokeh.io
bokeh.io.output_notebook()
Setup¶
To start, set up some functions to benchmark.
In [2]:
import time
import numpy as np
def test_fn_1():
good = np.random.poisson(20)
bad = np.random.poisson(100)
msec = np.random.choice([good, bad], p=[.99, .01])
time.sleep(msec / 3000.)
def test_fn_1_no_outliers():
time.sleep(np.random.poisson(20) / 3000.)
def test_fn_2():
good = np.random.poisson(5)
bad = np.random.poisson(150)
msec = np.random.choice([good, bad], p=[.95, .05])
time.sleep(msec / 3000.)
def test_fn_3():
msec = max(1, np.random.normal(100, 10))
time.sleep(msec / 3000.)
numbers = np.arange(0, 1, 1. / (3 * 5000000))
def test_fn_4():
return np.sum(numbers)
# Create a variable named "samples", in this cell. This way,
# if we change these functions, we'll reset the samples so we
# don't use old data with changed implementations.
samples = None
Benchmark¶
Run the benchmark for a while by executing this cell. Since we capture
the output data in samples
, and pass it back in as an argument, you
can interrupt the cell, take a look at the output so far, and then
execute this cell again to resume the benchmark.
In [3]:
samples = perfume.bench(test_fn_1, test_fn_2, test_fn_3, test_fn_4,
samples=samples)
function | test_fn_1 | test_fn_2 | test_fn_3 | test_fn_4 |
---|---|---|---|---|
count | 158 | 158 | 158 | 158 |
mean | 7.1 | 3.42 | 33.5 | 13.5 |
std | 2.38 | 8.59 | 3.04 | 2.19 |
min | 3.28 | 0.548 | 25.7 | 8.19 |
25% | 5.91 | 1.34 | 31.4 | 12.9 |
50% | 6.9 | 1.84 | 33.6 | 14.2 |
75% | 7.97 | 2.35 | 35.7 | 14.9 |
max | 30 | 56.7 | 41.3 | 16.9 |
test_fn_2 | test_fn_3 | test_fn_4 | |
---|---|---|---|
K-S test Z | |||
test_fn_1 | 8.49 | 8.83 | 7.65 |
test_fn_2 | nan | 8.61 | 8.55 |
test_fn_3 | nan | nan | 8.89 |
test_fn_2 | test_fn_3 | test_fn_4 | |
---|---|---|---|
K-S test Z | |||
test_fn_1 | 16 | 22 | 22 |
test_fn_2 | nan | 22 | 22 |
test_fn_3 | nan | nan | 22 |
Analyzing the samples¶
Let’s look at the format of the output, each function execution gets its begin and end time recorded:
In [4]:
samples.head()
Out[4]:
function | test_fn_1 | test_fn_2 | test_fn_3 | test_fn_4 | ||||
---|---|---|---|---|---|---|---|---|
timing | begin | end | begin | end | begin | end | begin | end |
0 | 6.668990e+07 | 6.668991e+07 | 6.668991e+07 | 6.668991e+07 | 6.668991e+07 | 6.668994e+07 | 6.668994e+07 | 6.668995e+07 |
1 | 6.668995e+07 | 6.668995e+07 | 6.668995e+07 | 6.668995e+07 | 6.668995e+07 | 6.668998e+07 | 6.668998e+07 | 6.668999e+07 |
2 | 6.668999e+07 | 6.669000e+07 | 6.669000e+07 | 6.669000e+07 | 6.669000e+07 | 6.669004e+07 | 6.669004e+07 | 6.669004e+07 |
3 | 6.669004e+07 | 6.669005e+07 | 6.669005e+07 | 6.669006e+07 | 6.669006e+07 | 6.669009e+07 | 6.669009e+07 | 6.669010e+07 |
4 | 6.669010e+07 | 6.669011e+07 | 6.669011e+07 | 6.669011e+07 | 6.669011e+07 | 6.669014e+07 | 6.669014e+07 | 6.669015e+07 |
One thing we can do is plot each function’s distribution as it develops over simulated time:
In [5]:
perfume.analyze.cumulative_quantiles_plot(samples)
We can run a K-S test and see whether our functions are significantly different:
In [6]:
perfume.analyze.ks_test(perfume.analyze.timings(samples))
Out[6]:
test_fn_2 | test_fn_3 | test_fn_4 | |
---|---|---|---|
K-S test Z | |||
test_fn_1 | 8.494414 | 8.831940 | 7.650598 |
test_fn_2 | NaN | 8.606922 | 8.550668 |
test_fn_3 | NaN | NaN | 8.888194 |
We can convert them to elapsed timings instead of begin/end time points, get resampled timings to see outliers show a stronger presence, or isolate samples to be as if they ran by themselves
In [7]:
timings = perfume.analyze.timings(samples)
bt = perfume.analyze.bucket_resample_timings(samples)
isolated = perfume.analyze.isolate(samples)
isolated.head()
Out[7]:
function | test_fn_1 | test_fn_2 | test_fn_3 | test_fn_4 | ||||
---|---|---|---|---|---|---|---|---|
timing | begin | end | begin | end | begin | end | begin | end |
0 | 0.000000 | 7.879083 | 0.000000 | 1.532343 | 0.000000 | 29.188473 | 0.000000 | 9.990919 |
1 | 7.879083 | 13.432623 | 1.532343 | 2.720641 | 29.188473 | 59.487223 | 9.990919 | 19.318506 |
2 | 13.432623 | 21.638016 | 2.720641 | 3.269045 | 59.487223 | 93.088024 | 19.318506 | 28.697242 |
3 | 21.638016 | 29.194047 | 3.269045 | 6.157080 | 93.088024 | 128.720489 | 28.697242 | 39.406731 |
4 | 29.194047 | 34.441334 | 6.157080 | 7.378798 | 128.720489 | 160.600259 | 39.406731 | 48.569794 |
With these, and other charting libraries, you can do whatever you want with the data:
In [8]:
from bokeh import palettes
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
fig, ax = plt.subplots(figsize=(16, 9))
for col, color in zip(timings.columns, palettes.Set1[len(timings.columns)]):
sns.distplot(timings[col], label=col, color=color, ax=ax,
# hist_kws=dict(cumulative=True),
# kde_kws=dict(cumulative=True)
)
ax.set_xlabel('millis')
ax.legend()
timings.describe()
Out[8]:
function | test_fn_1 | test_fn_2 | test_fn_3 | test_fn_4 |
---|---|---|---|---|
count | 158.000000 | 158.000000 | 158.000000 | 158.000000 |
mean | 7.099615 | 3.423204 | 33.526883 | 13.455989 |
std | 2.382655 | 8.588543 | 3.039051 | 2.193190 |
min | 3.280499 | 0.548404 | 25.672770 | 8.190510 |
25% | 5.908776 | 1.335742 | 31.432720 | 12.909524 |
50% | 6.904165 | 1.840656 | 33.613588 | 14.226354 |
75% | 7.966824 | 2.353760 | 35.680228 | 14.863801 |
max | 29.985672 | 56.681542 | 41.308623 | 16.934982 |

In [9]:
import matplotlib.pyplot as plt
timings['test_fn_1'].hist(cumulative=True, normed=1, alpha=0.3)
timings['test_fn_2'].hist(cumulative=True, normed=1, alpha=0.3)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f94c9fa8f28>

In [10]:
import matplotlib.pyplot as plt
bt['test_fn_1'].hist(cumulative=True, normed=1, alpha=0.3)
bt['test_fn_2'].hist(cumulative=True, normed=1, alpha=0.3)
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f94c9d21a58>

In [11]:
sns.pairplot(timings#, diag_kws={'cumulative': True}
)
Out[11]:
<seaborn.axisgrid.PairGrid at 0x7f94c9ad46a0>

In [12]:
import scipy.stats
bt = perfume.analyze.bucket_resample_timings(samples)
(scipy.stats.ks_2samp(timings['test_fn_1'], timings['test_fn_2']),
scipy.stats.ks_2samp(bt['test_fn_1'], bt['test_fn_2']))
Out[12]:
(Ks_2sampResult(statistic=0.95569620253164556, pvalue=5.5872505324246181e-65),
Ks_2sampResult(statistic=0.71199999999999997, pvalue=4.691029271698989e-223))
perfume¶
perfume package¶
Submodules¶
perfume.analyze module¶
perfume.analyze
contains transformation and analysis tools.
These functions mostly take as input the samples collected by
perfume.bench()
.
-
perfume.analyze.
bucket_resample_timings
(samples, sample_size=10, agg=<function mean>, sample_count=1000)[source]¶
-
perfume.analyze.
cumulative_quantiles
(samples, rng=None)[source]¶ Computes “cumulative quantiles” for each function.
That is, for each time, what are the extremes, median, and 25th/75th percentiles for all observations up until that point.
-
perfume.analyze.
cumulative_quantiles_plot
(samples, plot_width=960, plot_height=480, show_samples=True)[source]¶ Plots the cumulative quantiles along with a scatter plot of observations.
-
perfume.analyze.
isolate
(samples)[source]¶ For each function, isolates its begin and end times.
Within each function’s begins and ends, each begin will be equal to the previous end. This gives a sequence of begins and ends as if each function were run in isolation with no benchmarking overhead.
-
perfume.analyze.
ks_test
(t)[source]¶ Runs the Kolmogorov-Smirnov test across functions.
Returns a DataFrame containing all pairwise K-S test results.
The standard K-S test computes \(D\), which is the maximum difference between the empirical CDFs.
The result value we return is the \(Z\) value, defined as
\[Z = D / \sqrt{(n + m) / nm}\]where \(n\) and \(m\) are the respective sample sizes.
\(Z\) is typically interpreted using a lookup table, i.e. for confidence level \(\alpha\), we want to see a \(Z\) greater than \(c(\alpha)\):
\(\alpha\) 0.10 0.05 0.025 0.01 0.005 0.001 \(c(\alpha)\) 1.22 1.36 1.48 1.63 1.73 1.95
perfume.perfume module¶
Main module.
-
perfume.perfume.
bench
(*fns, samples=None, efficiency=0.9)[source]¶ Benchmarks functions, displaying results in a Jupyter notebook.
Runs
fns
repeatedly, collecting timing information, untilKeyboardInterrupt
is raised, at which point benchmarking stops and the results so far are returned.Parameters: - fns (list of callable) – A list of functions to benchmark and compare
- samples (pandas.DataFrame) – Optionally, pass the results of a previous call to
bench()
to continue from its already collected data. - efficiency (float) – Number between 0 and 1. Represents the target portion of time we aim to spend running the functions under test (so, we spend up to \(1 - efficiency\) time analyzing and rendering plots).
Returns: A dataframe containing the results so far. The row index is just an autoincrement integer, and the column index is a
MultiIndex
where the first level is function name and the second level isbegin
orend
.Return type:
Module contents¶
Top-level package for perfume.
-
perfume.
bench
(*fns, samples=None, efficiency=0.9)[source]¶ Benchmarks functions, displaying results in a Jupyter notebook.
Runs
fns
repeatedly, collecting timing information, untilKeyboardInterrupt
is raised, at which point benchmarking stops and the results so far are returned.Parameters: - fns (list of callable) – A list of functions to benchmark and compare
- samples (pandas.DataFrame) – Optionally, pass the results of a previous call to
bench()
to continue from its already collected data. - efficiency (float) – Number between 0 and 1. Represents the target portion of time we aim to spend running the functions under test (so, we spend up to \(1 - efficiency\) time analyzing and rendering plots).
Returns: A dataframe containing the results so far. The row index is just an autoincrement integer, and the column index is a
MultiIndex
where the first level is function name and the second level isbegin
orend
.Return type:
Contributing¶
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions¶
Report Bugs¶
Report bugs at https://github.com/leifwalsh/perfume/issues.
If you are reporting a bug, please include:
- Your operating system name and version.
- Any details about your local setup that might be helpful in troubleshooting.
- Detailed steps to reproduce the bug.
Fix Bugs¶
Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.
Implement Features¶
Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.
Write Documentation¶
perfume could always use more documentation, whether as part of the official perfume docs, in docstrings, or even on the web in blog posts, articles, and such.
Submit Feedback¶
The best way to send feedback is to file an issue at https://github.com/leifwalsh/perfume/issues.
If you are proposing a feature:
- Explain in detail how it would work.
- Keep the scope as narrow as possible, to make it easier to implement.
- Remember that this is a volunteer-driven project, and that contributions are welcome :)
Get Started!¶
Ready to contribute? Here’s how to set up perfume for local development.
Fork the perfume repo on GitHub.
Clone your fork locally:
$ git clone git@github.com:your_name_here/perfume.git
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
$ mkvirtualenv perfume $ cd perfume/ $ python setup.py develop
Create a branch for local development:
$ git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
$ flake8 perfume tests $ python setup.py test or py.test $ tox
To get flake8 and tox, just pip install them into your virtualenv.
Commit your changes and push your branch to GitHub:
$ git add . $ git commit -m "Your detailed description of your changes." $ git push origin name-of-your-bugfix-or-feature
Submit a pull request through the GitHub website.
Pull Request Guidelines¶
Before you submit a pull request, check that it meets these guidelines:
- The pull request should include tests.
- If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
- The pull request should work for Python 2.6, 2.7, 3.3, 3.4 and 3.5, and for PyPy. Check https://travis-ci.org/leifwalsh/perfume/pull_requests and make sure that the tests pass for all supported Python versions.
Credits¶
Development Lead¶
- Leif Walsh <leif.walsh@gmail.com>
Contributors¶
None yet. Why not be the first?
History¶
0.1.3 (2017-09-10)¶
- Actually fix when only benchmarking one function (no K-S test) (oops).
0.1.2 (2017-09-10)¶
- Fix when only benchmarking one function (no K-S test).
0.1.1 (2017-08-27)¶
- Add dependency on matplotlib.
0.1.0 (2017-08-27)¶
- First release on PyPI.
- Interactive histogram while benchmarking with bokeh.
- Interactive descriptive stats and K-S test.
- Cumulative distribution plots.
- Bucketed resampling.