perfume: Interactive performance benchmarking in Jupyter

Contents:

perfume

https://img.shields.io/pypi/v/perfume-bench.svg https://img.shields.io/travis/leifwalsh/perfume.svg Documentation Status Updates

Interactive performance benchmarking in Jupyter

Overview

perfume is a performance benchmarking tool that provides quick feedback on the systems under test.

The primary goals are:

  • Prioritize analysis of distributions of latency, not averages.

  • Support both immediate feedback and robust benchmarking with many samples, through a UI that updates as we collect more information.

  • Provide raw data back to the user, for flexible custom analysis.

  • Provide helpful post-processing analysis and charting tools.

Features

  • Live-updating histogram chart and descriptive statistics during a benchmark run.

  • Jupyter notebook integration.

  • Tunable benchmarking overhead.

  • Comparative analysis of multiple functions under test.

  • Powerful post-processing analysis tools.

Demo

You can check out an example notebook using perfume.

_images/perfume.gif _images/cumulative_quantiles.png

Installing

pip install perfume-bench

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Installation

Stable release

To install perfume, run this command in your terminal:

$ pip install perfume-bench

This is the preferred method to install perfume, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

From sources

The sources for perfume can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone https://github.com/leifwalsh/perfume

Or download the tarball:

$ curl  -OL https://github.com/leifwalsh/perfume/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

Usage

To use perfume in a project:

import perfume
# In a Jupyter notebook, you'll want these to get the plots to
# display inline
import bokeh.io
bokeh.io.output_notebook()

Collecting samples

The entry point to perfume is perfume.bench(), which takes a list of Python functions, and benchmarks them together. It will display, and update:

  • A plot showing histograms of sampled latencies for each function, with a kernel density estimator and 25th, 50th, and 75th percentiles.

  • A table of pandas.DataFrame.describe() output for the sampled timings.

  • Two pairwise Kolmogorov-Smirnov test tables, one on the raw samples, the other more sensitive to outliers (more on how later).

_images/perfume1.gif

Analyzing results

When you run perfume.bench(), it returns a DataFrame of the samples, represented as wall-clock begin and end times. There are several built-in ways to interpret these, collected in perfume.analyze, a few examples here:

  • perfume.analyze.timings() interprets the begin/end times in the samples DataFrame and computes the differences between end and begin, giving you a new DataFrame containing elapsed time values.

  • perfume.analyze.isolate() adjusts each function’s begin/end times to be as if that function was benchmarked in isolation, so each begin matches the previous end. This can be interpreted as “time to simulated completion” of a given fixed-size workload.

  • perfume.analyze.ks_test() runs the Kolmogorov-Smirnov test across the benchmarked functions (given output from perfume.analyze.timings()), giving pairwise results. The 2-sample K-S test is a measure of how different two distributions are (briefly, it’s the largest vertical distance between their ECDFs). For a function you think you’ve optimized, you can run the old and new versions and use the K-S test to get a sense for how likely it is that you’ve made a consistent improvement.

  • perfume.analyze.cumulative_quantiles_plot() creates a plot over time of the median, upper and lower quantiles, and min and max, for all samples collected up until that point in simulation time. Each function is charted with its timings in isolation (see perfume.analyze.isolate()), so faster functions cover less of the x-axis. For example:

    _images/cumulative_quantiles1.png

See perfume.analyze for the full set of analysis tools.

Example notebook

[1]:
import perfume
import perfume.analyze
import pandas as pd
import bokeh.io
bokeh.io.output_notebook()
Loading BokehJS ...

Setup

To start, set up some functions to benchmark.

[2]:
import time
import numpy as np

def test_fn_1():
    good = np.random.poisson(20)
    bad = np.random.poisson(100)
    msec = np.random.choice([good, bad], p=[.99, .01])
    time.sleep(msec / 3000.)

def test_fn_1_no_outliers():
    time.sleep(np.random.poisson(20) / 3000.)

def test_fn_2():
    good = np.random.poisson(5)
    bad = np.random.poisson(150)
    msec = np.random.choice([good, bad], p=[.95, .05])
    time.sleep(msec / 3000.)

def test_fn_3():
    msec = max(1, np.random.normal(100, 10))
    time.sleep(msec / 3000.)

numbers = np.arange(0, 1, 1. / (3 * 5000000))

def test_fn_4():
    return np.sum(numbers)

# Create a variable named "samples", in this cell.  This way,
# if we change these functions, we'll reset the samples so we
# don't use old data with changed implementations.
samples = None

Benchmark

Run the benchmark for a while by executing this cell. Since we capture the output data in samples, and pass it back in as an argument, you can interrupt the cell, take a look at the output so far, and then execute this cell again to resume the benchmark.

[3]:
samples = perfume.bench(test_fn_1, test_fn_2, test_fn_3, test_fn_4,
                        samples=samples)
Descriptive Timing Statistics
function test_fn_1 test_fn_2 test_fn_3 test_fn_4
count 158 158 158 158
mean 7.1 3.42 33.5 13.5
std 2.38 8.59 3.04 2.19
min 3.28 0.548 25.7 8.19
25% 5.91 1.34 31.4 12.9
50% 6.9 1.84 33.6 14.2
75% 7.97 2.35 35.7 14.9
max 30 56.7 41.3 16.9
K-S test
test_fn_2 test_fn_3 test_fn_4
K-S test Z
test_fn_1 8.49 8.83 7.65
test_fn_2 nan 8.61 8.55
test_fn_3 nan nan 8.89
Bucketed K-S test
test_fn_2 test_fn_3 test_fn_4
K-S test Z
test_fn_1 16 22 22
test_fn_2 nan 22 22
test_fn_3 nan nan 22

Analyzing the samples

Let’s look at the format of the output, each function execution gets its begin and end time recorded:

[4]:
samples.head()
[4]:
function test_fn_1 test_fn_2 test_fn_3 test_fn_4
timing begin end begin end begin end begin end
0 6.668990e+07 6.668991e+07 6.668991e+07 6.668991e+07 6.668991e+07 6.668994e+07 6.668994e+07 6.668995e+07
1 6.668995e+07 6.668995e+07 6.668995e+07 6.668995e+07 6.668995e+07 6.668998e+07 6.668998e+07 6.668999e+07
2 6.668999e+07 6.669000e+07 6.669000e+07 6.669000e+07 6.669000e+07 6.669004e+07 6.669004e+07 6.669004e+07
3 6.669004e+07 6.669005e+07 6.669005e+07 6.669006e+07 6.669006e+07 6.669009e+07 6.669009e+07 6.669010e+07
4 6.669010e+07 6.669011e+07 6.669011e+07 6.669011e+07 6.669011e+07 6.669014e+07 6.669014e+07 6.669015e+07

One thing we can do is plot each function’s distribution as it develops over simulated time:

[5]:
perfume.analyze.cumulative_quantiles_plot(samples)

We can run a K-S test and see whether our functions are significantly different:

[6]:
perfume.analyze.ks_test(perfume.analyze.timings(samples))
[6]:
test_fn_2 test_fn_3 test_fn_4
K-S test Z
test_fn_1 8.494414 8.831940 7.650598
test_fn_2 NaN 8.606922 8.550668
test_fn_3 NaN NaN 8.888194

We can convert them to elapsed timings instead of begin/end time points, get resampled timings to see outliers show a stronger presence, or isolate samples to be as if they ran by themselves

[7]:
timings = perfume.analyze.timings(samples)
bt = perfume.analyze.bucket_resample_timings(samples)
isolated = perfume.analyze.isolate(samples)
isolated.head()
[7]:
function test_fn_1 test_fn_2 test_fn_3 test_fn_4
timing begin end begin end begin end begin end
0 0.000000 7.879083 0.000000 1.532343 0.000000 29.188473 0.000000 9.990919
1 7.879083 13.432623 1.532343 2.720641 29.188473 59.487223 9.990919 19.318506
2 13.432623 21.638016 2.720641 3.269045 59.487223 93.088024 19.318506 28.697242
3 21.638016 29.194047 3.269045 6.157080 93.088024 128.720489 28.697242 39.406731
4 29.194047 34.441334 6.157080 7.378798 128.720489 160.600259 39.406731 48.569794

With these, and other charting libraries, you can do whatever you want with the data:

[8]:
from bokeh import palettes
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

fig, ax = plt.subplots(figsize=(16, 9))
for col, color in zip(timings.columns, palettes.Set1[len(timings.columns)]):
    sns.distplot(timings[col], label=col, color=color, ax=ax,
#                  hist_kws=dict(cumulative=True),
#                  kde_kws=dict(cumulative=True)
                )
ax.set_xlabel('millis')
ax.legend()
timings.describe()
[8]:
function test_fn_1 test_fn_2 test_fn_3 test_fn_4
count 158.000000 158.000000 158.000000 158.000000
mean 7.099615 3.423204 33.526883 13.455989
std 2.382655 8.588543 3.039051 2.193190
min 3.280499 0.548404 25.672770 8.190510
25% 5.908776 1.335742 31.432720 12.909524
50% 6.904165 1.840656 33.613588 14.226354
75% 7.966824 2.353760 35.680228 14.863801
max 29.985672 56.681542 41.308623 16.934982
_images/example_16_1.png
[9]:
import matplotlib.pyplot as plt
timings['test_fn_1'].hist(cumulative=True, normed=1, alpha=0.3)
timings['test_fn_2'].hist(cumulative=True, normed=1, alpha=0.3)
[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f94c9fa8f28>
_images/example_17_1.png
[10]:
import matplotlib.pyplot as plt
bt['test_fn_1'].hist(cumulative=True, normed=1, alpha=0.3)
bt['test_fn_2'].hist(cumulative=True, normed=1, alpha=0.3)
[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f94c9d21a58>
_images/example_18_1.png
[11]:
sns.pairplot(timings#, diag_kws={'cumulative': True}
            )
[11]:
<seaborn.axisgrid.PairGrid at 0x7f94c9ad46a0>
_images/example_19_1.png
[12]:
import scipy.stats
bt = perfume.analyze.bucket_resample_timings(samples)
(scipy.stats.ks_2samp(timings['test_fn_1'], timings['test_fn_2']),
 scipy.stats.ks_2samp(bt['test_fn_1'], bt['test_fn_2']))
[12]:
(Ks_2sampResult(statistic=0.95569620253164556, pvalue=5.5872505324246181e-65),
 Ks_2sampResult(statistic=0.71199999999999997, pvalue=4.691029271698989e-223))

perfume

perfume package

Submodules

perfume.analyze module

perfume.analyze contains transformation and analysis tools.

These functions mostly take as input the samples collected by perfume.bench().

perfume.analyze.bucket_resample_timings(samples, sample_size=10, agg=<function mean>, sample_count=1000)[source]
perfume.analyze.cumulative_quantiles(samples, rng=None)[source]

Computes “cumulative quantiles” for each function.

That is, for each time, what are the extremes, median, and 25th/75th percentiles for all observations up until that point.

perfume.analyze.cumulative_quantiles_plot(samples, plot_width=960, plot_height=480, show_samples=True)[source]

Plots the cumulative quantiles along with a scatter plot of observations.

perfume.analyze.isolate(samples)[source]

For each function, isolates its begin and end times.

Within each function’s begins and ends, each begin will be equal to the previous end. This gives a sequence of begins and ends as if each function were run in isolation with no benchmarking overhead.

perfume.analyze.ks_test(t)[source]

Runs the Kolmogorov-Smirnov test across functions.

Returns a DataFrame containing all pairwise K-S test results.

The standard K-S test computes \(D\), which is the maximum difference between the empirical CDFs.

The result value we return is the \(Z\) value, defined as

\[Z = D / \sqrt{(n + m) / nm}\]

where \(n\) and \(m\) are the respective sample sizes.

\(Z\) is typically interpreted using a lookup table, i.e. for confidence level \(\alpha\), we want to see a \(Z\) greater than \(c(\alpha)\):

\(\alpha\)

0.10

0.05

0.025

0.01

0.005

0.001

\(c(\alpha)\)

1.22

1.36

1.48

1.63

1.73

1.95

perfume.analyze.timings(samples)[source]

Converts samples to sample times per observation.

perfume.analyze.timings_in_context(samples)[source]

Returns a sparse dataframe with a time index, with timings.

Each cell contains the timing observed, at the time when it was observed. Therefore, each row will have NaNs except for the function whose sample completed at that time.

perfume.colors module

perfume.colors.colors(num_colors)[source]

perfume.perfume module

Main module.

class perfume.perfume.Display(names, initial_size, width=900, height=480)[source]

Bases: object

elapsed_rendering_ratio()[source]
initialize_plot(title)[source]
update(samples)[source]
class perfume.perfume.Timer[source]

Bases: object

property begin
elapsed_seconds()[source]
property end
classmethod time(fn, *args, **kwargs)[source]
perfume.perfume.bench(*fns, samples=None, efficiency=0.9)[source]

Benchmarks functions, displaying results in a Jupyter notebook.

Runs fns repeatedly, collecting timing information, until KeyboardInterrupt is raised, at which point benchmarking stops and the results so far are returned.

Parameters
  • fns (list of callable) – A list of functions to benchmark and compare

  • samples (pandas.DataFrame) – Optionally, pass the results of a previous call to bench() to continue from its already collected data.

  • efficiency (float) – Number between 0 and 1. Represents the target portion of time we aim to spend running the functions under test (so, we spend up to \(1 - efficiency\) time analyzing and rendering plots).

Returns

A dataframe containing the results so far. The row index is just an autoincrement integer, and the column index is a MultiIndex where the first level is function name and the second level is begin or end.

Return type

pandas.DataFrame

Module contents

Top-level package for perfume.

perfume.bench(*fns, samples=None, efficiency=0.9)[source]

Benchmarks functions, displaying results in a Jupyter notebook.

Runs fns repeatedly, collecting timing information, until KeyboardInterrupt is raised, at which point benchmarking stops and the results so far are returned.

Parameters
  • fns (list of callable) – A list of functions to benchmark and compare

  • samples (pandas.DataFrame) – Optionally, pass the results of a previous call to bench() to continue from its already collected data.

  • efficiency (float) – Number between 0 and 1. Represents the target portion of time we aim to spend running the functions under test (so, we spend up to \(1 - efficiency\) time analyzing and rendering plots).

Returns

A dataframe containing the results so far. The row index is just an autoincrement integer, and the column index is a MultiIndex where the first level is function name and the second level is begin or end.

Return type

pandas.DataFrame

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

Report Bugs

Report bugs at https://github.com/leifwalsh/perfume/issues.

If you are reporting a bug, please include:

  • Your operating system name and version.

  • Any details about your local setup that might be helpful in troubleshooting.

  • Detailed steps to reproduce the bug.

Fix Bugs

Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.

Implement Features

Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.

Write Documentation

perfume could always use more documentation, whether as part of the official perfume docs, in docstrings, or even on the web in blog posts, articles, and such.

Submit Feedback

The best way to send feedback is to file an issue at https://github.com/leifwalsh/perfume/issues.

If you are proposing a feature:

  • Explain in detail how it would work.

  • Keep the scope as narrow as possible, to make it easier to implement.

  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Get Started!

Ready to contribute? Here’s how to set up perfume for local development.

  1. Fork the perfume repo on GitHub.

  2. Clone your fork locally:

    $ git clone git@github.com:your_name_here/perfume.git
    
  3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

    $ mkvirtualenv perfume
    $ cd perfume/
    $ python setup.py develop
    
  4. Create a branch for local development:

    $ git checkout -b name-of-your-bugfix-or-feature
    

    Now you can make your changes locally.

  5. When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:

    $ flake8 perfume tests
    $ python setup.py test or py.test
    $ tox
    

    To get flake8 and tox, just pip install them into your virtualenv.

  6. Commit your changes and push your branch to GitHub:

    $ git add .
    $ git commit -m "Your detailed description of your changes."
    $ git push origin name-of-your-bugfix-or-feature
    
  7. Submit a pull request through the GitHub website.

Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

  1. The pull request should include tests.

  2. If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.

  3. The pull request should work for Python 2.6, 2.7, 3.3, 3.4 and 3.5, and for PyPy. Check https://travis-ci.org/leifwalsh/perfume/pull_requests and make sure that the tests pass for all supported Python versions.

Tips

To run a subset of tests:

$ python -m unittest tests.test_perfume

Credits

Development Lead

Contributors

None yet. Why not be the first?

History

0.1.3 (2017-09-10)

  • Actually fix when only benchmarking one function (no K-S test) (oops).

0.1.2 (2017-09-10)

  • Fix when only benchmarking one function (no K-S test).

0.1.1 (2017-08-27)

  • Add dependency on matplotlib.

0.1.0 (2017-08-27)

  • First release on PyPI.

  • Interactive histogram while benchmarking with bokeh.

  • Interactive descriptive stats and K-S test.

  • Cumulative distribution plots.

  • Bucketed resampling.

Indices and tables