Welcome to Habakkuk’s documentation!

Contents:

Introduction

Habakkuk is a tool intended for analysis and performance prediction of code fragments (kernels) written in Fortran. Habakkuk is written in Python and makes use of fparser (https://github.com/stfc/fparser), a Fortran parser.

Getting going

Download

Habakkuk is available from the Python Package Index (pypi). The project itself is hosted on github (https://github.com/arporter/habakkuk).

Dependencies

Habakkuk is written in Python and so needs Python (either 2.7 or 3.6+) to be installed on the target machine. It also requires the (Python) fparser and six packages. In order to run the test suite you will require py.test.

Installation

Using pip

The recommended way of installing Habakkuk is to use pip. This will obtain the package from pypi as well as any required dependencies:

$ pip install Habakkuk

By default, pip attempts to perform a system-wide installation which requires root privileges. Alternatively, a user-local installation may be requested by specifying the –user flag:

$ pip install --user Habakkuk

This will install the package(s) under ${HOME}/.local. Depending on your linux distribution, you may need to add ${HOME}/.local/bin to your PATH and ${HOME}/.local/lib/pythonX.Y/site-packages/ to your PYTHONPATH. (X.Y is the version of python your system is running, e.g. 2.7.)

From tarball

If pip is not available then tarballs of each of the releases of Habakkuk are available on github (https://github.com/arporter/habakkuk/releases). Once the tarball has been downloaded and unpacked, change to the resulting habakkuk directory and do:

$ python setup.py install

If you do not have root access then, as with using pip (above), you can specify the prefix for the install path like so:

$ python setup.py install --prefix ${HOME}/.local

Running

Habakkuk is run from the command line. The -h/–help flag will produce a list of the various available options:

$ habakkuk -h

Usage: habakkuk [options] <Fortran file(s)>

Options:
  -h, --help            show this help message and exit
  --no-prune            Do not attempt to prune duplicate operations from the
                        graph
  --no-fma              Do not attempt to generate fused multiply-add
                        operations
  --rm-scalar-tmps      Remove scalar temporaries from the DAG
  --show-weights        Display node weights in the DAG
  --unroll=UNROLL_FACTOR
                        No. of times to unroll a loop. (Applied to every loop
                        that is encountered.)

  Fortran code options:
    Specify information about Fortran codes.

    --mode=MODE         Specify Fortran code mode. Default: auto.

Habakkuk analyses Fortran source files, provided as arguments on the command line, e.g.:

$ habakkuk my_fortran_file1.f90 my_fortran_file2.F90

If all is well, you should see output similar to the following:

$ habakkuk tra_adv.F90
Habakkuk processing file 'tra_adv.F90'
Wrote DAG to tra_adv_loop1.gv
Stats for DAG tra_adv_loop1:
   0 addition operators.
   0 subtraction operators.
   0 multiplication operators.
   1 division operators.
   1 FLOPs in total.
   8 array references.
   8 distinct cache-line references.
   Naive FLOPs/byte = 0.016
   Whole DAG in serial:
     Sum of cost of all nodes = 8 (cycles)
     1 FLOPs in 8 cycles => 0.1250*CLOCK_SPEED FLOPS
     Associated mem bandwidth = 8.00*CLOCK_SPEED bytes/s
   Everything in parallel to Critical path:
     Critical path contains 4 nodes, 1 FLOPs and is 8 cycles long
     FLOPS (ignoring memory accesses) = 0.1250*CLOCK_SPEED
     Associated mem bandwidth = 8.00*CLOCK_SPEED bytes/s
 Schedule contains 1 steps:
             Execution Port
       0    1    2    3    4    5
 0   /    None None None None None (cost = 8)
   Estimate using computed schedule:
     Cost of schedule as a whole = 8 cycles
     FLOPS from schedule (ignoring memory accesses) = 0.1250*CLOCK_SPEED
     Associated mem bandwidth = 8.00*CLOCK_SPEED bytes/s
   Estimate using perfect schedule:
     Cost if all ops on different execution ports are perfectly overlapped = 8 cycles
   e.g. at 3.85 GHz, these different estimates give (GFLOPS):
   No ILP  |  Computed Schedule  |  Perfect Schedule | Critical path
    0.48   |          0.48       |        0.48       |    0.48
  with associated BW of 30.80,30.80,30.80,30.80 GB/s

Testing

The Habakkuk source contains a test-suite written to use py.test. In order to run it you will need to obtain the Habakkuk source - either by downloading a tarball of one of the [releases](https://github.com/arporter/habakkuk/releases) or by cloning the git repository. Assuming you have Habakkuk and py.test installed you can then do:

$ cd habakkuk/src/habakkuk/tests
$ py.test

Using Habakkuk

Performance Predictions

Note

To be written.

DAG Output

Note

To be written.

CPU Configuration

In order to predict performance on any given CPU, Habakkuk must be configured with certain parameters that describe that CPU. Currently, only parameters for the Intel Ivy Bridge micro-architecture are supplied. In this section we describe the various parameters that Habakkuk uses.

Instruction Cost

The number of cycles that a given arithmetic operation requires is fundamental to constructing a performance estimate of a kernel. For floating-point operations on Intel architectures, Agner Fog provides comprehensive data. However, since Habakkuk processes high-level Fortran code, it must also account for Fortran intrinsic operations such as SQRT and COS. The cost of these operations has been estimated by running simple micro-benchmarks (dl_microbench) on the target CPU. All of this data is provided to Habakkuk in a dictionary, OPERATORS. The keys in this dictionary are the Fortran symbols for the various arithmetic operations (e.g. “*” for multiplication) and the names of Fortran intrinsics in uppercase (e.g. “SIN”). Each entry in the dictionary is itself a dictionary with keys “cost” and “flops”. The entries for these keys are integers giving the number of cycles and number of floating-point operations, respectively, associated with the operation.

Mapping of Instructions to Execution Ports

In the Intel Ivy Bridge micro-architecture, instructions are despatched to different “execution ports”, depending on their type. For example, floating-point multiplication and division is handled by port 0 while addition and subtraction are sent to port 1. Provided there are no dependencies between them, operations that are mapped to different ports may be executed in parallel.

The number of different execution ports is specified in NUM_EXECUTION_PORTS and the mapping of instructions to them is supplied to Habakkuk as a dictionary, CPU_EXECUTION_PORTS. The keys in this dictionary are the same as those in OPERATORS and the associated entries are simply the integer index of the corresponding execution port.

Instruction Overlapping

In addition to the instruction-level parallelism offered by having instructions despatched to different execution ports, the differing cycle count of the various f.p. operations provides further scope for overlapping them. In particular, f.p. division on Ivy Bridge takes eight times longer than e.g. multiplication and addition/subtraction. Even though multiplication and division operations are mapped to the same execution port, the results of micro-benchmarks show that the hardware is able to perform several multiplications while a single division is in progress. Similarly, multiple addition/subtraction operations may be performed on port 1 while a single division is in progress on port 0.

Whether or not this overlapping is supported by the CPU is configured by setting SUPPORTS_DIV_MUL_OVERLAP to True or False, as appropriate. If overlapping is supported then further data is required on the degree of overlapping permitted and the cost in cycles.

MAX_DIV_OVERLAP is a dictionary with keys “*” and “+”. The entries are the maximum number of each of those operations that may be overlapped with a single division.

div_overlap_mul_cost(overlaps) is a routine that returns the cost of a division (in cycles) as a function of the number of other operations that it is overlapped with. overlaps is a dictionary holding the number of each type of operation (“*” and “+”) that has been overlapped with the division. (Subtraction operations are counted as addition operations here.)

Support for Fused Multiply-Add Operations

Some micro-architectures (but not Intel Ivy Bridge) have support for fused multiply-add instructions. i.e. a*x + b can be performed in a single operation. SUPPORTS_FMA must be set to True or False, as appropriate.

Cache-line Size

The number of bytes contained in a single cache line must be supplied in CACHE_LINE_BYTES. This is used when estimating memory-bandwidth requirements.

Clock Speed

A representative clock-speed for the CPU being modelled is supplied in EXAMPLE_CLOCK_GHZ. This is used to generate concrete performance figures. Note that on CPU cores with frequency-stepping and turbo boost enabled, there is no single clock speed!

Indices and tables