Welcome to BONSAI’s documentation!¶
BONSAI is Python based infrastructure for the autotuning of GPGPU computational kernels. BONSAI is based on LANAI, a Python based way of specifying the search space of parameters upon which to tune the kernel. These ‘iterators’ are used to create the valid parameter configurations. BONSAI will then take your templated C/C++ code and benchtest it with those configurations.
Contents¶
Installation¶
Installation is straight forward, unzip the tarball from http://icl.cs.utk.edu/bonsai/software/index.html and be sure that Python 2.7 is selected as your Python interpreter.
Tutorial¶
This is a tutorial on how to get started with BONSAI. First, BONSAI requires:
- Python 2.7+ (Python 3 is not supported)
- GCC 4.4 (or similar) or later + make
There are three primary steps in an auto-tuning workflow,
- Generating and pruning the Search Space
- Tuning and selecting the proper kernel configuration
- Analyzing the generated profiles to inform the next tuning run
Variables in the Makefile are defined in make.inc for easy customization.
There are two top level make targets, make space and make tune. The first generates the search space, and the second performs a tuning sweep over the search space with the provided kernel.
Working examples are provided in the examples folder, one CPU and one GPU, of varying complexity. The CPU example is simple with one iterator for tutorial purposes. The GPU example requires CUDA and the CUDA toolkit be installed. The GPU example is rather complex, showing how BONSAI can be incorporated into complex kernels.
Space Generation and Pruning¶
The first step in the autotuning workflow is to generate the search space for the parameters. (make space) This is done using a LANAI iterator file. For this example, we will use the following (admittedly toy) iterator.
from lanai import *
max_sz = 32
# -----------------
dim_m = range(1,max_sz)
dim_n = range(1,max_sz)
@iterator
def blk_m(dim_m):
return range(dim_m, max_sz, dim_m)
@iterator
def blk_n(dim_n):
return range(dim_n, max_sz, dim_n)
#------------------
@condition
def only_even(dim_m, dim_n):
return( (dim_m % 2) == 1 or (dim_n % 2) == 1)
dim_m and dim_n can be viewed as the total size of a matrix panel, and blk_m / blk_n the size of the blocks within the panel. In LANAI terms, m and n are called ‘expression’ or ‘global’ iterators; blk_m and blk_n are called ‘deferred’ or ‘local’ iterators since they depend on a global value. We also have one condition for these iterators, in that we only want to consider even numbers for the sizes. An important note, conditions are written as what should fail, not what should be true. (see LANAI for more)
After specifying an iterator file (ITER in make.inc), BONSAI will take it and generate a C source file (by default sharing the same name as the iterator file) that defines the search space. BONSAI then will then compile the generator file using the specified options, and output to the specified CSV file.
Integrating with the Runtime¶
The runtime is expecting two routines, the kernel to be tested and a driver for the kernel that sets the parameters. The routines can be in the same file. As an example, look at the dgemm C code below (this is from examples/gemm_cpu).
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <unistd.h>
#include <time.h>
#ifndef FLOAT_TYPE
#define FLOAT_TYPE double
#endif
int main(){
// Here, BLOCK_SIZE is an iterator that is defined within
// the LANAI iterator file
struct timeval start;
struct timeval end;
unsigned long diff;
srand((unsigned) time(NULL));
int i, j, k, i_b, j_b, k_b;
int n = 1024;
int blk = BLOCK_SIZE;
FLOAT_TYPE* A = (FLOAT_TYPE*) malloc(sizeof(FLOAT_TYPE)*n*n);
FLOAT_TYPE* B = (FLOAT_TYPE*) malloc(sizeof(FLOAT_TYPE)*n*n);
FLOAT_TYPE* C = (FLOAT_TYPE*) malloc(sizeof(FLOAT_TYPE)*n*n);
for (i=0; i < n*n; i++)
A[i] = ((FLOAT_TYPE)rand() / RAND_MAX - 0.5)*2;
for (i=0; i < n*n; i++)
B[i] = ((FLOAT_TYPE)rand() / RAND_MAX - 0.5)*2;
for (i=0; i < n*n; i++)
C[i] = ((FLOAT_TYPE)rand() / RAND_MAX - 0.5)*2;
gettimeofday(&start, NULL);
for (i_b=0; i_b < n/blk; i_b++)
for (j_b=0; j_b < n/blk; j_b++)
for (k_b=0; k_b < n/blk; k_b++)
for (i=i_b*blk; i < (i_b+1)*blk; i++)
for (j=j_b*blk; j < (j_b+1)*blk; j++)
for (k=k_b*blk; k < (k_b+1)*blk; k++)
C[i + j*n] += A[i + k*n] * B[k + j*n];
gettimeofday(&end, NULL);
diff = 1000000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec);
printf("Time: %ld us\n", diff);
free(A);
free(B);
free(C);
return 0;
}
With the corresponding iterator file
from lanai import *
block_size = range(2, 66)
@condition
def only_even(block_size):
return block_size % 2 == 1
Note how we have a correspondence between the iterator block_size and the assignment int blk = BLOCK_SIZE. The runtime is expecting the iterator names to be used like #define macros in the kernel/driver files. The runtime will construct the correct compiler statement for each configuration using the -D flag to define the macro values.
LANAI¶
LANguage for Autotuning Infrastructure¶
LANAI is a python based method for specifing the search space and pruning constraints for autotuning programs. This is the basis for the entire BONSAI infrastructure.
General Structure¶
The general structure for a LANAI specification file is
#imports
#iterator constants
#iterator specifications
#constraint constants
#constraint specifications
LANAI allows for outside imports in its iterator files for abstraction purposes. For example, say a file that contains various CUDA constants for different compute capabilities and a file with constants for a particular card. The top of the iterator file would then be
from lanai import *
from cuda_constants import *
from Telsa_K40c import *
Following that may come any other constants that we may wish to specify.
Iterators¶
LANAI supports 2 basic iterator types (expression and deferred) each corresponding to a specific scope (global and local).
Global/Expression Iterators¶
Global/expression iterators take the form of:
dim_x = range(1, max_dim_x+1)
using the standard Python range
syntax. The 3 argument version of
range
is also supported. These are the most basic form of iterator
expressions.
Local/Deferred Iterators¶
Local/deferred iterators are allowed to take other iterators and/or
other constants as arguments. Local iterators must also be decorated
with @iterator
before the iterator definition. This takes the basic
form of
@iterator
def blk_x(dim_x):
return range(dim_x, max_dim_x+1, dim_x)
Deferred iterators are also allowed to contain more complex flow control statements as well. For example
@iterator
def vec_mul(dim_vec):
if dim_vec == 1:
return range(0, 1)
else:
return range(0, 2)
dim_vec
is also an iterator, so vec_mul
can take on different
values depending on what value dim_vec
is currently holding. It is
probably worth mentioning that the iterator definition of dim_vec
must come before any other definition where it is used as an argument.
Conditions¶
LANAI condition statements prune the search space generated by the
iterators. Each condition statement must be decorated with a
@condition
before the definition as so:
@condition
def over_max_threads(threads_per_block):
return threads_per_block > max_threads_per_block
Conditions are always written as a boolean statement that returns ‘true’ when the condition fails. Not following this can lead to unwanted behavior. As an example, look at the following two iterators and pruning condition, where we want both R1 and R2 to be greater than 0.
R1 = range(3)
R2 = range(3)
@condition
def greater_than_zero():
return R1 > 0 and R2 > 0
Without the condition, we get the search space:
R1, R2
0 , 0
0 , 1
0 , 2
1 , 0
1 , 1
1 , 2
2 , 0
2 , 1
2 , 2
Adding the condition prunes the search space to:
R1, R2
0 , 0
0 , 1
0 , 2
1 , 0
2 , 0
This seems to be a bug. The problem lies with how the condition statement was defined. We specified the condition as a statement that we want to always hold, not the condition that we want to fail. To fix this, take the logical complement of the condition:
@condition
def greater_than_zero():
return R1 <= 0 or R2 <= 0
This prunes the search to the space we expected:
R1, R2
1 , 1
1 , 2
2 , 1
2 , 2
Oddities¶
To declare a consant value, it still must be written as an iterator with size one, e.g.
tex_a = range(1,2) #
Release Notes¶
Release 17.10¶
This is a pre-alpha release. The features pertaining to the generation and pruning of the search space are useable. The parallel tuning runtime is still in early stages of development. A simple serial runtime has been included as a prototype, to demonstrate the method of integrating a kernel with the framework.