Overview of FPGA-Based Deep Learning Accelerators: Challenges and Opportunities

FPGA-based neural network accelerators are gaining traction in the AI ​​community today, and this article provides an overview of the opportunities and challenges of FPGA-based deep learning accelerators.

In recent years, neural networks have made great progress in various fields compared to traditional algorithms. In the fields of image, video, and speech processing, various network models have been proposed, such as convolutional neural networks and recurrent neural networks. The well-trained CNN model improved the classification accuracy of the top 5 categories of images on the ImageNet dataset from 73.8% to 84.7%, and further improved the target detection accuracy by virtue of its excellent feature extraction ability. RNNs have achieved state-of-the-art word error rate records in speech recognition. All in all, neural networks have become a strong candidate for many artificial intelligence applications due to their high adaptability to a large number of pattern recognition problems.

However, the neural network model still has the problems of large amount of computation and complex storage. At the same time, the research of neural network is still mainly focused on the improvement of the scale of the network model. For example, a state-of-the-art CNN model for 224×224 image classification requires 39 billion floating point operations (FLOPs) and over 500MB of model parameters. Since the computational complexity is directly proportional to the size of the input image, the amount of computation required to process a high-resolution image can exceed 100 billion.

Therefore, choosing a moderate computing platform for neural network applications is particularly important. Generally speaking, the CPU can complete 10-100 GFLOP operations per second, but the energy efficiency is usually lower than 1 GOP/J, so it is difficult to meet the high performance requirements of cloud applications and the low energy consumption requirements of mobile apps. In contrast, GPUs can provide peak performance of 10TOP/S, so it is an excellent choice for high-performance neural network applications. In addition, programming frameworks such as Caffe and TensorFlow also provide easy-to-use interfaces on GPU platforms, making GPUs the first choice for neural network acceleration.

In addition to CPUs and GPUs, FPGAs are emerging as the platform of choice for energy-efficient neural network processing. According to the computational process of the neural network, combined with the hardware designed for the specific model, the FPGA can achieve a high degree of parallelism and simplify the logic. Several studies have shown that neural network models can be simplified in a hardware-friendly way without affecting the accuracy of the model. Therefore, FPGAs can achieve higher energy efficiency than CPUs and GPUs.

Back in the 1990s, when FPGAs were just emerging, they were not designed for neural networks, but for the rapid development of Electronic hardware prototypes. Due to the emergence of neural networks, people have begun to explore and improve their applications, but their development direction cannot be determined. Although DS Reay used FPGA to implement neural network acceleration for the first time in 1994, due to the immature development of neural network itself, this technology has not received much attention. It was not until the emergence of AlexNet in the ILSVRC challenge in 2012 that the development of neural networks became clear, and the research community began to develop deeper and more complex network research. Later, models such as VGGNet, GoogleNet, and ResNet appeared, and the trend of more and more complex neural networks became clearer. At that time, researchers began to notice FPGA-based neural network accelerators, as shown in Figure 1 below. Until last year, the number of FPGA-based neural network accelerators published on IEEE eXplore had reached 69, and counting. This is enough to illustrate the research trend in this direction.

Overview of FPGA-Based Deep Learning Accelerators: Challenges and Opportunities

Figure 1: History of FPGA-based neural network accelerator development

Paper: A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities

Paper address: https://arxiv.org/abs/1901.04988

Abstract: With the rapid development of deep learning, neural networks and deep learning algorithms have been widely used in various fields, such as image, video and speech processing. However, neural network models are also getting larger, which is reflected in the computation of model parameters. Although significant efforts have been made on GPU platforms to improve computational performance, dedicated hardware solutions are still essential and are emerging as advantages over pure software solutions. In this paper, the authors systematically explore FPGA-based neural network accelerators. Specifically, they review accelerators for specific problems, specific algorithms, algorithmic features, general templates, respectively, and also compare the design and implementation of FPGA-based accelerators in different device and network models, and compare them with CPU and GPU versions. comparison. Finally, the authors discuss the advantages and disadvantages of accelerators on FPGA platforms and further explore opportunities for future research.

Figure 2: Comparison of different data quantization methods

Table 1: Performance comparison of different models on different platforms

Opportunities and Challenges

As early as the 1960s, Gerald Estrin proposed the concept of reconfigurable computing. But it wasn’t until 1985 that the first FPGA chips were introduced by Xilinx. Although the parallelism and power consumption of the FPGA platform are excellent, the platform has not attracted much attention due to its high reconfiguration cost and complicated programming. With the continuous development of deep learning, the high parallelism of its application makes more and more researchers invest in FPGA-based deep learning accelerator research. This is also the trend of the times.

Advantages of FPGA-Based Accelerators

1) High performance, low energy consumption: The advantage of high energy efficiency cannot be underestimated, which has been proven by many previous studies. It can be seen from Table 1 that the performance of GOP/j on the FPGA platform can reach dozens of times that of the CPU platform, and its lowest level on the FPGA platform is at the same level as its performance on the GPU platform. This is enough to illustrate the energy-efficient advantages of FPGA-based neural network accelerators.

2) High parallelism: High parallelism is the main feature of choosing an FPGA platform to accelerate deep learning. Due to the programmable logic hardware unit of the FPGA, the hardware can be easily optimized using parallelization algorithms, achieving high parallelism.

3) Flexibility: Due to the reconfigurability of FPGA, it can be applied to complex engineering environments. For example, after the hardware design and application design are completed, it is found through experiments that the performance fails to reach the ideal state. The reconfigurability enables FPGA-based hardware accelerators to handle frequent design changes well and meet the changing needs of users. Therefore, this flexibility is also a highlight of FPGA platforms compared to ASIC platforms.

4) Security: Today’s AI era requires more and more data for training. Therefore, data security is increasingly important. As the carrier of data, the security of the computer has also become more significant. At present, when it comes to computer security, all kinds of antivirus software come to mind. But these software can only passively defend, can not eliminate security risks. In contrast, starting from the hardware architecture level can better improve security.

Disadvantages of FPGA-Based Accelerators

1) Reconfigurability cost: The reconfigurability of the FPGA platform is a double-edged sword. Although it provides many conveniences in terms of computing speed, the time consumed by the refactoring of different designs cannot be ignored, usually taking tens of minutes to several hours. Furthermore, the refactoring process is divided into two types: static refactoring and dynamic refactoring. Static refactoring, also known as compile-time refactoring, refers to the ability to configure hardware to handle one or more system functions before a task is run, and lock it until the task is complete. The other is also called runtime configuration. Dynamic refactoring is done in context configuration mode. During the execution of the task, the hardware modules should be reconfigured as needed. But it’s very prone to delays, increasing runtime.

2) Programming difficulties: Although the concept of reconfigurable computing architecture has been proposed for a long time and there are many mature works, reconfigurable computing has not become popular before. There are two main reasons:

The 40 years from the advent of reconfigurable computing to the beginning of the 21st century was the golden age of Moore’s Law, during which technology changed every year and a half. Therefore, the performance improvement brought by this architecture update is not as direct and powerful as technological progress;

For mature systems, traditional programming on the CPU uses high-level abstract programming languages. However, reconfigurable computing requires hardware programming, and the commonly used hardware programming languages ​​(Verilog, VHDL) take a lot of time for programmers to master.

expect

Although FPGA-based neural network accelerators still have problems of one kind and another, their future development is still promising. The following directions are still to be studied:

For other parts of the optimization calculation process, now, the mainstream research focuses on the matrix operation loop, and the calculation of the activation function is rarely involved.

Access optimization. Additional optimization methods for data access need further research.

Data optimization. Use lower-bit data that naturally improves platform performance, but most of the lower-bit data makes the weights the same as the bit width of the neuron. Figure 2 can also improve the bit width difference with nonlinear mapping. Therefore, a better balance should be explored.

Frequency optimization. At present, the operating frequency of most FPGA platforms is 100-300MHz, but the theoretical operating frequency of FPGA platforms can be higher. This frequency is mainly limited by the threads between the on-chip SRAM and the DSP. Future research needs to find out if there is a way to avoid or solve this problem.

FPGA fusion. According to the performance mentioned in reference paper 37, multi-FPGA clusters can achieve better results if the planning and allocation problems can be solved well. Furthermore, there is currently not much research in this direction. So it’s well worth exploring further.

Automatic configuration. In order to solve the complex programming problems on the FPGA platform, if a user-friendly automatic deployment framework like NVIDIA CUDA is made, the application scope will definitely be broadened.

The Links:   LQ9D340 LQ121S1DG21A