Now that the last post covered how to build the example for emulation and hardware deployment, we are now ready to go in a bit more detail. This post will go over this simple baseline implementation for a 2 layer fully connected network running inference on FPGAs.

Contents

Part 1: How to build FPGA applications on AWS
Part 2: A baseline implementation for fully connected networks (this post)

Code walkthrough

The model we’re trying to run is as simple as it gets. Two fully connected layers; the first one ReLu-activated and the second (and final) layer with a softmax activation.

class FCNN:
    def __init__(self, input_size: int, num_classes: int):
        self.layer1 = Dense(input_size, 64)
        self.layer2 = Dense(64, num_classes)

    def __call__(self, x: Tensor) -> Tensor:
        y = x
        y = self.layer1(y).relu6()
        return self.layer2(y).logsoftmax()

This means we’ll need to implement the following three kernels:

  • matrix-matrix multiplication
  • bias-addition + ReLu non-linearity
  • bias-addition + softmax non-linearity

The reason for separating the matrix-multiplication and bias-addition operations into two kernels is that the former is likely going to be the biggest performance bottleneck. Therefore, keeping them separated will make experimentation with different matrix-multiplication implementation easier later on.
Also, fusing the bias-addition and non-linearity functions into a single kernel will likely be beneficial for performance. Both operations are relatively fast, so the additional overhead from launching a kernel will be more noticeable. Both operations also have a low computation-to-memory-transfer ratio, and hence, benefit from reducing the memory access by executing both bias-addition and non-linearity in one pass.

The structure of the model is also duplicated for the C++ implementation in the net.hpp file, which we will cover later.

Kernel implementations

The first and simplest kernel we’ll write is the bias + ReLu6 kernel. Thanks to the HLS using C++ and not including any optional, FPGA-specific code, the first baseline implementation looks like standard C++ code

extern "C" void bias_relu6_kernel(float *const activation, const float *const bias, const uint batch_size, const uint dim)
{
   for (uint b = 0; b < batch_size; b++)
   {
      for (uint d = 0; d < dim; d++)
      {
         const uint ia = dim * b + d;
         activation[ia] = relu6(activation[ia] + bias[d]);
      }
   }
}

The implementation for the bias + softmax kernel is very similar. Note that the result is computed inplace using activation both as an input and an output. This is not possible for the matrix-multiplication kernel since the shape of either input and the output can be different. Hence, we need to allocate sufficient memory outside of the kernel and pass in a pointer using the additional out argument. The naive implementation using three for-loops is certainly not optimal and will need to be revisited later:

extern "C" void matmul_kernel(const float *const matrixA, const float *const matrixB, const uint rowsA, const uint colsA, const uint colsB, float *const out)
{
   for (uint i = 0; i < rowsA; ++i)
   {
      for (uint j = 0; j < colsB; ++j)
      {
         // Nulling result here causes issues when running in hw-emu mode.
         // Looks like io isn't updated "in time"
         const uint io = colsB * i + j;
         for (uint k = 0; k < colsA; ++k)
         {
            const uint ia = colsA * i + k;
            const uint ib = colsB * k + j;
            out[io] += matrixA[ia] * matrixB[ib];
         }
      }
   }
}

The output array out should be initialized to zero outside of the kernel and all arrays are assumed to store their elements in row-major order.

Host application

So far, we’ve implemented all the code that will “run” on the FPGA device. The rest, which makes up the majority of the code, is for the host-device (CPU) and needed for memory management as well as dispatching the kernels.

The Matrix class abstracts away the memory management as well as the host-device memory transfers. One major constraint for using Vitis is that all memory copied to or from the FPGA device needs to be page-aligned on the host device, i.e. both the starting address and memory size have to be divisible by page size. For this, we use a custom allocator with DEFAULT_ALIGNMENT being hard coded to the page size 4096:

template <typename T>
T *aligned_alloc(std::size_t num, std::size_t alignment = DEFAULT_ALIGNMENT)
{
    void *ptr = nullptr;
    if (posix_memalign(&ptr, alignment, num * sizeof(T)))
    {
        throw std::bad_alloc();
    }
    return reinterpret_cast<T *>(ptr);
}

The functions to_device and to_cpu handle memory transfer between the host and the device. We implement move-semantics for the Matrix class by implementing special constructors and the =-operator to allow for copy-free return values from functions.
Finally, the two helper functions at the bottom of matrix.hpp abstract the OpenCL kernel-dispatch overhead for us. For example, the apply_matmul function applies the matmul_kernel to two Matrix instances:

std::pair<Matrix, cl::Event> apply_matmul(Matrix &matrixA, Matrix &matrixB, cl::Kernel &kernel, std::vector<cl::Event> *wait_on = NULL, DeviceHandle &handle = HANDLE)
{
    Matrix result = Matrix::constant(matrixA.rows, matrixB.cols, 0.0, 4096);
    result.to_device(handle);
    kernel.setArg(0, matrixA.get_buffer());
    kernel.setArg(1, matrixB.get_buffer());
    kernel.setArg(2, matrixA.rows);
    kernel.setArg(3, matrixA.cols);
    kernel.setArg(4, matrixB.cols);
    kernel.setArg(5, result.get_buffer());

    cl::Event event;
    handle.q.enqueueTask(kernel, wait_on, &event);
    return std::make_pair(std::move(result), event);
}

Since OpenCL provides an async API, we do not simply invoke the kernel, but we enqueue the task of running the kernel in a command queue. This task may depend on other previously enqueued tasks, which can be expressed using the (optional) wait_on argument. A reference to the newly invoked task is returned as the second return value of type cl::Event. Only after this task is processed does the return value result contain the computed value. To see how this is used, take a look at the forward-pass of the network:

  Matrix operator()(Matrix &input)
    {
        std::vector<cl::Event> events;
        events.resize(3);
        Matrix y;
        std::tie(y, events[0]) = apply_matmul(input, weight1, MATMUL_KERNEL);

The matrix-multiplication in the first layer is independent of any operation. Therefore, we do not pass the wait_on argument in this line. The event-result is then assigned to the first entry of the events vector and duplicated in the following lines to make sure every entry of events is a valid cl::Event instance:

        events[1] = events[0];
        events[2] = events[0];
        events[1] = apply_bias(y, bias1, BIAS_RELU6_KERNEL, &events);

A pointer to this vector is then passed on to the bias-activation part of the first layer since it depends on the prior matrix-multiplication to finish. A similar effect can be achieved by only allocating a vector of size one at this point, but then we would have to resize the vector after each additional operation. The following lines then apply the second layer of the network accordingly.

        std::tie(y, events[2]) = apply_matmul(y, weight2, MATMUL_KERNEL, &events);
        apply_bias(y, bias2, BIAS_SOFTMAX_KERNEL, &events);
        return y;
    }

Now we are ready to look at the high-level implementation in main.cpp:

int main(int argc, const char *argv[])
{
    init_kernels();

    auto model = FCNN("weights/");
    auto input = Matrix::from_npy("weights/samples.npy");
    input.to_device();

The first call to init_kernels loads the OpenCL kernels from a separate binary file and stores references to them in global variables. Next, we load the model weights from separate .npy files, one for each tensor. Finally, we also load the input samples that we will feed into the model. These .npy files are prepared by the train script and stored as float32 arrays.

Next, we run the model, wait for all OpenCL events to finish, and copy the result back from the device to the host:

    auto result = model(input);

    finish_cl_queue();
    result.to_cpu();
    finish_cl_queue();

Finally, for each element in the batch, we compute the argmax of the confidence scores to get the final prediction and print it to stdout:

    // print argmax result
    for (int i = 0; i < result.rows; i++)
    {
        float minval = -1;
        int idx = -1;

        for (int j = 0; j < result.cols; j++)
        {
            auto val = result(i, j);
            if (minval < val)
            {
                idx = j;
                minval = val;
            }
        }

        std::cout << idx << " ";
    }
    std::cout << std::endl;
}

In the next posts we will look at the performance characteristics of this baseline implementation and how to improve both latency as well as reduce the resource requirements.

Cover Image by Pedant01 CC BY-SA 4.0