Winograd Neural Network Accelerator Using Dynamic Hardware Reconfiguration on FPGA Platform

doi:10.3778/j.issn.1002-8331.2307-0257

Abstract

Abstract: To address the low resource utilization and resource-restricted problems of convolutional neural networks (CNNs) in FPGA-based hardware acceleration, this paper proposes a convolutional neural network accelerator based on FPGA dynamic partial reconfiguration technique and Winograd fast convolution. The accelerator multiplexes FPGA resources in runtime and dynamically configures various calculation pipelines to the FPGA using a pipeline method. The convolutional computation cores corresponding to each pipeline segment are optimized using Winograd algorithm customization to maximize the utilization of computing resources while solving the resource limitation problem. For the proposed accelerator architecture, this paper further establishes a combinatorial optimization model to search for the optimal parallel strategy to deploy a specific network model on a particular FPGA hardware platform, working with genetic algorithm for exploring the design space. Based on the Xilinx VC709 FPGA platform, the VGG-16 network model is deployed and analyzed. The comprehensive simulation results show that large-scale neural network models can be adaptively implemented on resource-limited FPGAs. The overall performance of the accelerator can reach 1?078.3 GOPS, which is 2.2 times and 3.62 times better than the performance and computing resource utilization efficiency of previous accelerators, respectively.

Key words: convolutional neural network, dynamic partial hardware reconfiguration, field programmable gate array (FPGA), hardware accelerator, Winograd fast convolution

摘要： 为解决卷积神经网络在FPGA平台上进行硬件加速时存在的资源利用率低和资源受限问题，提出了一种基于FPGA动态部分重构技术和Winograd快速卷积的卷积神经网络加速器。该加速器通过运行时硬件重构对FPGA片上资源进行时分复用，采用流水线方式动态地将各个计算流水段配置到FPGA，各个流水段所对应的卷积计算核心使用Winograd算法进行定制优化，以在解决资源受限问题的同时最大程度地提升计算资源利用效率。针对该加速器架构，进一步构建了组合优化模型，用于搜索在特定FPGA硬件平台上部署特定网络模型的最优并行策略，并使用遗传算法进行设计空间求解。基于Xilinx VC709 FPGA平台对VGG-16网络模型进行部署和分析，综合仿真结果表明，所提出的设计方法能够在资源有限的FPGA上自适应地实现大型神经网络模型，加速器整体性能可以达到1?078.3?GOPS，较以往加速器的性能和计算资源利用效率可以分别提升2.2倍和3.62倍。

关键词: 卷积神经网络, 动态部分硬件重构, 现场可编程门阵列（FPGA）, 硬件加速器, Winograd快速卷积

MEI Bingxiao, TENG Wenbin, ZHANG Chi, WANG Wenhao, LI Fuqiang, YUAN Fuli. Winograd Neural Network Accelerator Using Dynamic Hardware Reconfiguration on FPGA Platform[J]. Computer Engineering and Applications, 2024, 60(22): 323-334.

梅冰笑, 滕文彬, 张弛, 王文浩, 李富强, 苑福利. FPGA平台上动态硬件重构的Winograd神经网络加速器[J]. 计算机工程与应用, 2024, 60(22): 323-334.

References

[1] HE K, ZHANG X, REN S, et al. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, 2015: 1026-1034.
[2] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409.1556, 2014.
[3] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[4] NANE R, SIMA V M, PILATO C, et al. A survey and evaluation of FPGA high-level synthesis tools[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2015, 35(10): 1591-1604.
[5] NURVITADHI E, VENKATESH G, SIM J, et al. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?[C]//Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017: 5-14.
[6] ZHANG C, LI P, SUN G, et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks[C]//Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2015: 161-170.
[7] ZHANG C, SUN G, FANG Z, et al. Caffeine: toward uniformed representation and acceleration for deep convolutional neural networks[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 38(11): 2072-2085.
[8] MA Y, CAO Y, VRUDHULA S, et al. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks[C]//Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017: 45-54.
[9] CHEN T, DU Z, SUN N, et al. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning[J]. ACM SIGARCH Computer Architecture News, 2014, 42(1): 269-284.
[10] LIU S, DU Z, TAO J, et al. Cambricon: an instruction set architecture for neural networks[J]. ACM SIGARCH Computer Architecture News, 2016, 44(3): 393-405.
[11] XIAO Q, LIANG Y, LU L, et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs[C]//Proceedings of the 54th Annual Design Automation Conference, 2017: 1-6.
[12] GONG L, WANG C, LI X, et al. MALOC: a fully pipelined FPGA accelerator for convolutional neural networks with all layers mapped on chip[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 37(11): 2601-2612.
[13] GONG L, WANG C, LI X, et al. Work-in-progress: a power-efficient and high performance FPGA accelerator for convolutional neural networks[C]//Proceedings of the 2017 International Conference on Hardware/Software Codesign and System Synthesis, 2017: 1-2.
[14] VIPIN K, FAHMY S A. FPGA dynamic and partial reconfiguration: a survey of architectures, methods, and applications[J]. ACM Computing Surveys, 2018, 51(4): 1-39.
[15] ANSARI A, GUNNAM K, OGUNFUNMI T. An efficient reconfigurable hardware accelerator for convolutional neural networks[C]//Proceedings of the 2017 51st Asilomar Conference on Signals, Systems, and Computers, 2017: 1337-1341.
[16] VENIERIS S I, BOUGANIS C S. fpgaConvNet: a framework for mapping convolutional neural networks on FPGAs[C]//Proceedings of the 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines, 2016: 40-47.
[17] LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 4013-4021.
[18] GONG L, WANG C, LI X, et al. Improving HW/SW adaptability for accelerating CNNs on FPGAs through a dynamic/static co-reconfiguration approach[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 32(7): 1854-1865.
[19] 苑福利, 宫磊, 娄文启, 等. 动态重构硬件加速中的性能开销建模[J]. 计算机工程与应用, 2022, 58(6): 69-79.
YUAN F L, GONG L, LOU W Q, et al. Performance cost modeling in dynamic reconfiguration hardware acceleration[J]. Computer Engineering and Applications, 2022, 58(6): 69-79.