Summary

Yapocis is a Python package that makes it easier to call high performance OpenCL kernels from high level Python code. It depends on, and was inspired by, PyOpenCL which does the hard work of making OpenCL callable at all. Yapocis is intended to make calling it much less painful.

The code is currently developed and tested with Python 3.7 on OS/X Big Sur. It was originally developed with Python 2.7 on Snow Leopard, and there were modest changes in jumping forward about 10 years. In addition, the code has been cleaned, refactored, and extended with kernels and interfaces for on-device graphics using OpenCL.

Quick look

And here’s a fast comparison between three ways of getting an embarassingly parallel thing done from Python:

First the easy way, with numpy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#  Copyright Sean D. True & SwapWizard dba (c) 2011-2021.

import numpy
import numpy.linalg as la
import time

a = numpy.random.rand(50000)
b = numpy.random.rand(50000)

t = time.time()
a_plus_b = a+b
print("Elapsed:", time.time() - t)

print((la.norm(a_plus_b - (a+b)), la.norm(a_plus_b)))

Second, the hard and fast way, with raw pyopencl:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#  Copyright Sean D. True & SwapWizard dba (c) 2011-2021.

import pyopencl as cl
import numpy
import numpy.linalg as la
import time

# Make sure we allocate arrays of the type expected
a = numpy.random.rand(50000).astype(numpy.float32)
b = numpy.random.rand(50000).astype(numpy.float32)

# Talk to the compiler and linker, setup the runtime
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

t = time.time()

# Allocate buffers on the device to hold our data
# and copy it down to the device
mf = cl.mem_flags
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
b_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, b.nbytes)

# Compile the kernel
prg = cl.Program(ctx, """
    __kernel void sum(__global const float *a,
    __global const float *b, __global float *c)
    {
      int gid = get_global_id(0);
      c[gid] = a[gid] + b[gid];
    }
    """).build()

# Run the kernel
prg.sum(queue, a.shape, None, a_buf, b_buf, dest_buf)

# And more biffer management to get the data back
a_plus_b = numpy.empty_like(a)
cl.enqueue_copy(queue, a_plus_b, dest_buf)

# But the answer comes back fast, and good.
print((la.norm(a_plus_b - (a+b)), la.norm(a_plus_b)))
print("Elapsed:", time.time() - t)

Finally, the easy and fast way, with yapocis:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#  Copyright Sean D. True & SwapWizard dba (c) 2011-2021.

import yapocis.rpc.programs
from yapocis.rpc import engines, interfaces
import numpy
import numpy.linalg as la
import time

# We don't _need_ specify the dtype: that's done in the interface
a = numpy.random.rand(50000)
b = numpy.random.rand(50000)

t = time.time()
# We don't need to manage a compiler and linker, that can be done for us
demo = yapocis.rpc.programs.load_program("demo")

# We don't need to manage buffers, that can be done for us
a_plus_b = demo.sum(a,b)
print((la.norm(a_plus_b - (a+b)), la.norm(a_plus_b)))
print("Elapsed:", time.time() - t)

# And there are interesting stats available
print("Stats", demo)