Skip to content

Vector Add Example

Felix Weiglhofer edited this page Apr 27, 2023 · 2 revisions

The typical 'Hello World' example for GPU programming is adding two vectors element-wise. This guide will provide a walkthrough of how to do this with xpu.

Declaring kernels

In your header file, kernels are declared as callable objects:

// VectorAdd.h
#include <xpu/device.h>

struct DeviceLib {}; // Dummy type to match kernels to a library.

struct VectorAdd : xpu::kernel<DeviceLib> {
    using context = xpu::kernel_context<xpu::no_smem>; // optional shorthand
    XPU_D void operator()(context &, xpu::buffer<const float>, xpu::buffer<const float>, xpu::buffer<float>, size_t);
};

There's already a lot going on here, so let's break it down. First we include the xpu/device.h header. This header contains all the declarations needed to implement kernels and the entire api for the device side functions. Next we declare the dummy type DeviceLib. This type is required to match kernels to a device library.

The actual kernel is declared as a callable object that inherits from xpu::kernel. The kernel is implemented as a regular C++ function. The first argument of each kernel is the kernel_context. This object contains information about the current thread and block. Additionally it provides access to constant and shared memory.

Lastly the XPU_D-macro is used to declare device-side functions and kernels. It replaces the __device__ keyword in CUDA and HIP.

Implementing kernels

In your source file, kernels are implemented as regular C++ functions:

// VectorAdd.cpp
#include "VectorAdd.h"

XPU_IMAGE(DeviceLib); // Define the device library. This call must happen in exactly one source file.

XPU_EXPORT(VectorAdd); // Export the kernel.
XPU_D void VectorAdd::operator()(context &ctx,
        xpu::buffer<const float> a, xpu::buffer<const float> b, xpu::buffer<float> c, size_t n)
{
    size_t i = ctx.block_idx_x() * ctx.block_dim_x() + ctx.thread_idx_x(); // Get the global thread index.
    if (i >= n) return; // Check if we are out of bounds.
    c[i] = a[i] + b[i];
}

The first thing we need to do is define the device library by calling XPU_IMAGE. This call must be done exactly once per device library.

To make the kernel available to the host, we need to export it by calling XPU_EXPORT. Next we implement the kernel. Here we first compute the global thread index and check if we're out of bounds. If not, we add the corresponding elements of a and b and store the result in c.

Calling kernels on the host

#include "VectorAdd.h"

#include <xpu/host.h>

int main()
{
    xpu::initialize(); // Initialize xpu.

    // Create buffers on the host and device.
    const size_t NElems = 1000;
    xpu::buffer<float> a(NElems, xpu::shared_buffer);
    xpu::buffer<float> b(NElems, xpu::shared_buffer);
    xpu::buffer<float> c(NElems, xpu::shared_buffer);

    // Fill buffers with data.
    for (size_t i = 0; i < NElems; ++i)
        a[i] = b[i] = i;

    // Run the kernel.
    xpu::queue q;
    q.launch<VectorAdd>(xpu::n_threads(NElems), a, b, c, NElems);
    q.wait();

    // Check the result.
    for (size_t i = 0; i < NElems; ++i)
        assert(c[i] == 2 * i);
}

First we include VectorAdd.h for the kernel declaration and xpu/host.h for the host side api. Then we call xpu::initialize to initialize xpu. This must be done before any other xpu function is called.

Next we create buffers to allocate device memory and fill it with data. In this case we use a shared buffer (also called managed memory in CUDA) that can be accessed by the host and device. The GPU driver will automatically copy data to and from the device as needed. For an example of how to copy your data by hand, see the manual data transfer section.

To run the kernel, we must first create a queue. Then call queue::launch to start executing the kernel, which passed as a template argument. Then we pass the number of threads to run. This can be specified by calling xpu::n_threads, which takes the number of threads as an argument. Alternatively, the number of blocks can be specified by calling xpu::n_blocks instead. Lastly the kernel arguments are passed to the function. In this case we pass the buffers we created earlier. The last argument is the number of elements in the buffers. This is needed because the buffers don't know their size.

Finally we check the result. In this case we just check that the result is correct for the first 1000 elements.

Compiling and running the example

TODO

Note: Host code can call any kernels it's linked against. I.e. in this example, we could have also compiled the kernel into a separate library and linked our executable against it.

Manual data transfer

Clone this wiki locally