面向神经网络池化层的灵活高效硬件设计

doi:10.3778/j.issn.1002-8331.2207-0326

摘要/Abstract

摘要： 近年来，神经网络加速器逐渐成为研究热点，其中池化层是神经网络加速器的重要组成部分。使用专门的硬件设计方法设计池化层具有过程快和方便修改的优势，但也存在以下问题：不同的池化设计方案由于缺乏向上兼容性而无法适配到最新的神经网络；由于现有的池化方案数据间的复用程度低，导致池化性能偏低。基于此，提出一种面向神经网络池化层的灵活高效的硬件设计。该设计使用Verilog硬件描述语言实现，尽可能考虑到池化算法的各项参数，进而适配最新的神经网络，采取二维拆分与多数据递进处理使其具备高兼容性；结合行缓存提高该设计的性能；乒乓缓存、伪填充及特定池化核延展进一步降低资源使用量。通过实验对多个神经网络中的池化层进行了验证，结果表明，在200?MHz的工作频率下，与CPU（AMD TR Pro 3995WX）相比，运行最大池化最高可实现536倍的加速效果；运行平均池化最高可实现11?248倍的加速效果；运行YOLOv5的池化层时，与通用的数据不复用的方案相比，可以达到以3.5倍的资源获得27倍的加速比；运行GoogleNet的池化层时，与HLS设计方案相比，可实现接近同等的资源获得555倍的加速比。

关键词: 灵活高效池化, 硬件加速, Verilog HDL, 数据复用

Abstract: In recent years, neural network accelerator has gradually become a research hotspot, among which pooling layer is an important part of neural network accelerator. Using specialized hardware design methods to design the pooling layer has the advantages of fast process and easy modification, but it also has the following problems： Different pooling design schemes cannot adapt to the latest neural networks due to lack of upward compatibility. Due to the low reuse degree of data in existing pooling schemes, the pooling performance is low. Based on this, a flexible and efficient hardware design for neural network pooling layer is proposed. The design is implemented by using Verilog hardware description language, and the parameters of the pooling algorithm are considered as much as possible to adapt to the latest neural network. It adopts two dimensional splitting and multi-data progressive processing to make it have high compatibility. Combined with line cache, the performance of the design is improved. Ping-pong caching, spurious padding, and specific pooling kernel extensions further reduce resource usage. The experimental results show that the maximum pooling can achieve up to 536 times faster than CPU （AMD TR Pro 3995WX） at 200 MHz operating frequency. The average pooling can achieve up to 11 248 times of acceleration effect. When running the pooling layer of YOLOv5, it can achieve a speedup of 27 times with 3.5 times resources compared to the common scheme without data reuse. When running the pooling layer of GoogleNet, it can achieve nearly 555 times speedup over the HLS design for comparable resources.

Key words: flexible and efficient pooling, hardware acceleration, Verilog HDL, data reuse

何增, 朱国权, 岳克强. 面向神经网络池化层的灵活高效硬件设计[J]. 计算机工程与应用, 2023, 59(22): 315-321.

HE Zeng, ZHU Guoquan, YUE Keqiang. Flexible and Efficient Hardware Design for Neural Network Pooling Layer[J]. Computer Engineering and Applications, 2023, 59(22): 315-321.

参考文献

[1] DING W，HUANG Z Y，HUANG Z，et al.Designing efficient accelerator of depthwise separable convolutional neural network on FPGA[J].Journal of Systems Architecture，2019，97：278-286.
[2] KIM J，HUR S，LEE E，et al.NLP-Fast：a fast，scalable，and flexible system to accelerate large-scale heterogeneous NLP models[C]//Proceedings of the 2021 30th International Conference on Parallel Architectures and Compilation Techniques，2021：75-89.
[3] KARPATHY A，LI F F.Deep visual-semantic alignments for generating image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2017，39（6）：664-676.
[4] 陈浩敏，姚森敬，席禹，等.YOLOv3-tiny的硬件加速设计及FPGA实现[J].计算机工程与科学，2021，43（12）：2139-2149.
CHEN H M，YAO S J，XI Y，et al.Design and FPGA implementation of YOLOV3-tiny hardware acceleration[J].Computer Engineering and Science，2021，43（12）：2139-2149.
[5] 许杰，张子恒，王新宇，等.一种基于Zynq的CNN加速器设计与实现[J].计算机技术与发展，2021，31（11）：108-113.
XU J，ZHANG Z H，WANG X Y，et al.Design and implementation of CNN accelerator based on Zynq[J].Computer Technology and Development，2021，31（11）：108-113.
[6] CHO M，KIM Y.Implementation of data-optimized FPGA-based accelerator for convolutional neural network[C]//Proceedings of the 2020 International Conference on Electronics，Information，and Communication，2020：1-2.
[7] 王肖，邓军勇，谢晓燕.可重构卷积神经网络加速器设计与实现[J].传感器与微系统，2022，41（2）：82-85.
WANG X，DENG J Y，XIE X Y.Design and implementation of reconfigurable CNN accelerator[J].Sensor and MicroSystem，2022，41（2）：82-85.
[8] 魏武，杨靓.图像处理中数据复用及存储层次设计的研究[J].计算机技术与发展，2012，22（12）：43-46.
WEI W，YANG L.Data reuse and memory hierarchy design in image processing[J].Computer Technology and Development，2012，22（12）：43-46.
[9] ZHANG X.The AlexNet，LeNet-5 and VGG NET applied to CIFAR-10[C]//Proceedings of the 2021 2nd International Conference on Big Data & Artificial Intelligence & Software Engineering，2021：414-419.
[10] ANTIOQUIA A M C，TAN D S，AZCARRAGA A，et al.ZipNet：ZFNet-level accuracy with 48× fewer parameters[C]//Proceedings of the 2018 IEEE Visual Communications and Image Processing，2018：1-4.
[11] ASWATHY P，SIDDHARTHA，MISHRA D.Deep GoogLeNet features for visual object tracking[C]//Proceedings of the 2018 IEEE 13th International Conference on Industrial and Information Systems，2018：60-66.
[12] CHEN H Y，SU C Y.An enhanced hybrid MobileNet[C]//Proceedings of the 2018 9th International Conference on Awareness Science and Technology，2018：308-312.
[13] ZHANG K，GUO Y，WANG X，et al.Multiple feature reweight densenet for image classification[J].IEEE Access，2019，7：9872-9880.
[14] JIANG C，ZHANG H，YUE Y，et al.AM-YOLO：improved YOLOV4 based on attention mechanism and multi-feature fusion[C]//Proceedings of the 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference，2022：1403-1407.
[15] IEAMSAARD J，CHAROENSOOK S N，YAMMEN S.Deep learning-based face mask detection using YoloV5[C]//Proceedings of the 2021 9th International Electrical Engineering Congress，2021：428-431.
[16] Chilicyy.YOLOv6 release 0.1.0[CP/OL].（2022-06）[2022-07-19].https：//github.com/meituan/YOLOv6.
[17] 杨维科.基于RISC-V开源处理器的卷积神经网络加速器设计方法研究[D].上海：上海交通大学，2018.
YANG W K.Research on design method of convolutional neural network accelerator based on RISC-V open source processor[D].Shanghai：Shanghai Jiao Tong University，2018.
[18] 张卫，刘宇红，张荣芬.可实现时分复用的CNN卷积层和池化层IP核设计[J].计算机工程与应用，2020，56（24）：66-71.
ZHANG W，LIU Y H，ZHANG R F.Design of IP cores for CNN convolution layer and pooling layer capable of time division multiplexing[J].Computer Engineering and Applications，2020，56（24）：66-71.