From Scratchː Neural Network Inference on FPGAs – Part 2
A baseline implementation for fully connected networks
Now that the last post covered how to build the example for emulation and hardware deployment, we are now ready to go in a bit more detail. This post will go over this simple baseline implementation for a 2 layer fully connected network running inference on FPGAs.
Contents
Part 1: How to build FPGA applications on AWS
Part 2: A baseline implementation for fully connected networks (this post)
Code walkthrough
The model we’re trying to run is as simple as it gets. Two fully connected layers; the first one ReLu-activated and the second (and final) layer with a softmax activation.
class FCNN:
def __init__(self, input_size: int, num_classes: int):
self.layer1 = Dense(input_size, 64)
self.layer2 = Dense(64, num_classes)
def __call__(self, x: Tensor) -> Tensor:
y = x
y = self.layer1(y).relu6()
return self.layer2(y).logsoftmax()
This means we’ll need to implement the following three kernels:
- matrix-matrix multiplication
- bias-addition + ReLu non-linearity
- bias-addition + softmax non-linearity
The reason for separating the matrix-multiplication and bias-addition operations into two kernels is that the former is likely going to be the biggest performance bottleneck.
Therefore, keeping them separated will make experimentation with different matrix-multiplication implementation easier later on.
Also, fusing the bias-addition and non-linearity functions into a single kernel will likely be beneficial for performance.
Both operations are relatively fast, so the additional overhead from launching a kernel will be more noticeable.
Both operations also have a low computation-to-memory-transfer ratio, and hence, benefit from reducing the memory access by executing both bias-addition and non-linearity in one pass.
The structure of the model is also duplicated for the C++ implementation in the net.hpp
file, which we will cover later.
Kernel implementations
The first and simplest kernel we’ll write is the bias + ReLu6 kernel. Thanks to the HLS using C++ and not including any optional, FPGA-specific code, the first baseline implementation looks like standard C++ code
extern "C" void bias_relu6_kernel(float *const activation, const float *const bias, const uint batch_size, const uint dim)
{
for (uint b = 0; b < batch_size; b++)
{
for (uint d = 0; d < dim; d++)
{
const uint ia = dim * b + d;
activation[ia] = relu6(activation[ia] + bias[d]);
}
}
}
The implementation for the bias + softmax kernel is very similar.
Note that the result is computed inplace using activation
both as an input and an output.
This is not possible for the matrix-multiplication kernel since the shape of either input and the output can be different.
Hence, we need to allocate sufficient memory outside of the kernel and pass in a pointer using the additional out
argument.
The naive implementation using three for-loops is certainly not optimal and will need to be revisited later:
extern "C" void matmul_kernel(const float *const matrixA, const float *const matrixB, const uint rowsA, const uint colsA, const uint colsB, float *const out)
{
for (uint i = 0; i < rowsA; ++i)
{
for (uint j = 0; j < colsB; ++j)
{
// Nulling result here causes issues when running in hw-emu mode.
// Looks like io isn't updated "in time"
const uint io = colsB * i + j;
for (uint k = 0; k < colsA; ++k)
{
const uint ia = colsA * i + k;
const uint ib = colsB * k + j;
out[io] += matrixA[ia] * matrixB[ib];
}
}
}
}
The output array out
should be initialized to zero outside of the kernel and all arrays are assumed to store their elements in row-major order.
Host application
So far, we’ve implemented all the code that will “run” on the FPGA device. The rest, which makes up the majority of the code, is for the host-device (CPU) and needed for memory management as well as dispatching the kernels.
The Matrix
class abstracts away the memory management as well as the host-device memory transfers.
One major constraint for using Vitis is that all memory copied to or from the FPGA device needs to be page-aligned on the host device, i.e. both the starting address and memory size have to be divisible by page size.
For this, we use a custom allocator with DEFAULT_ALIGNMENT
being hard coded to the page size 4096
:
template <typename T>
T *aligned_alloc(std::size_t num, std::size_t alignment = DEFAULT_ALIGNMENT)
{
void *ptr = nullptr;
if (posix_memalign(&ptr, alignment, num * sizeof(T)))
{
throw std::bad_alloc();
}
return reinterpret_cast<T *>(ptr);
}
The functions to_device
and to_cpu
handle memory transfer between the host and the device.
We implement move-semantics for the Matrix
class by implementing special constructors and the =
-operator to allow for copy-free return values from functions.
Finally, the two helper functions at the bottom of matrix.hpp
abstract the OpenCL kernel-dispatch overhead for us.
For example, the apply_matmul
function applies the matmul_kernel
to two Matrix
instances:
std::pair<Matrix, cl::Event> apply_matmul(Matrix &matrixA, Matrix &matrixB, cl::Kernel &kernel, std::vector<cl::Event> *wait_on = NULL, DeviceHandle &handle = HANDLE)
{
Matrix result = Matrix::constant(matrixA.rows, matrixB.cols, 0.0, 4096);
result.to_device(handle);
kernel.setArg(0, matrixA.get_buffer());
kernel.setArg(1, matrixB.get_buffer());
kernel.setArg(2, matrixA.rows);
kernel.setArg(3, matrixA.cols);
kernel.setArg(4, matrixB.cols);
kernel.setArg(5, result.get_buffer());
cl::Event event;
handle.q.enqueueTask(kernel, wait_on, &event);
return std::make_pair(std::move(result), event);
}
Since OpenCL provides an async API, we do not simply invoke the kernel, but we enqueue the task of running the kernel in a command queue.
This task may depend on other previously enqueued tasks, which can be expressed using the (optional) wait_on
argument.
A reference to the newly invoked task is returned as the second return value of type cl::Event
.
Only after this task is processed does the return value result
contain the computed value.
To see how this is used, take a look at the forward-pass of the network:
Matrix operator()(Matrix &input)
{
std::vector<cl::Event> events;
events.resize(3);
Matrix y;
std::tie(y, events[0]) = apply_matmul(input, weight1, MATMUL_KERNEL);
The matrix-multiplication in the first layer is independent of any operation.
Therefore, we do not pass the wait_on
argument in this line.
The event-result is then assigned to the first entry of the events
vector and duplicated in the following lines to make sure every entry of events
is a valid cl::Event
instance:
events[1] = events[0];
events[2] = events[0];
events[1] = apply_bias(y, bias1, BIAS_RELU6_KERNEL, &events);
A pointer to this vector is then passed on to the bias-activation part of the first layer since it depends on the prior matrix-multiplication to finish. A similar effect can be achieved by only allocating a vector of size one at this point, but then we would have to resize the vector after each additional operation. The following lines then apply the second layer of the network accordingly.
std::tie(y, events[2]) = apply_matmul(y, weight2, MATMUL_KERNEL, &events);
apply_bias(y, bias2, BIAS_SOFTMAX_KERNEL, &events);
return y;
}
Now we are ready to look at the high-level implementation in main.cpp:
int main(int argc, const char *argv[])
{
init_kernels();
auto model = FCNN("weights/");
auto input = Matrix::from_npy("weights/samples.npy");
input.to_device();
The first call to init_kernels
loads the OpenCL kernels from a separate binary file and stores references to them in global variables.
Next, we load the model weights from separate .npy
files, one for each tensor.
Finally, we also load the input samples that we will feed into the model.
These .npy
files are prepared by the train script and stored as float32 arrays.
Next, we run the model, wait for all OpenCL events to finish, and copy the result back from the device to the host:
auto result = model(input);
finish_cl_queue();
result.to_cpu();
finish_cl_queue();
Finally, for each element in the batch, we compute the argmax of the confidence scores to get the final prediction and print it to stdout:
// print argmax result
for (int i = 0; i < result.rows; i++)
{
float minval = -1;
int idx = -1;
for (int j = 0; j < result.cols; j++)
{
auto val = result(i, j);
if (minval < val)
{
idx = j;
minval = val;
}
}
std::cout << idx << " ";
}
std::cout << std::endl;
}
In the next posts we will look at the performance characteristics of this baseline implementation and how to improve both latency as well as reduce the resource requirements.
Cover Image by Pedant01 CC BY-SA 4.0