计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (22): 315-321.DOI: 10.3778/j.issn.1002-8331.2207-0326

• 工程与应用 • 上一篇    下一篇

面向神经网络池化层的灵活高效硬件设计

何增,朱国权,岳克强   

  1. 1.杭州电子科技大学 电子信息学院,杭州 310018
    2.之江实验室 智能计算硬件研究中心,杭州 311100
  • 出版日期:2023-11-15 发布日期:2023-11-15

Flexible and Efficient Hardware Design for Neural Network Pooling Layer

HE Zeng, ZHU Guoquan, YUE Keqiang   

  1. 1.School of Electronic Information, Hangzhou Dianzi University, Hangzhou 310018, China
    2.Intelligent Computing Hardware Research Center, Zhijiang Laboratory, Hangzhou 311100, China
  • Online:2023-11-15 Published:2023-11-15

摘要: 近年来,神经网络加速器逐渐成为研究热点,其中池化层是神经网络加速器的重要组成部分。使用专门的硬件设计方法设计池化层具有过程快和方便修改的优势,但也存在以下问题:不同的池化设计方案由于缺乏向上兼容性而无法适配到最新的神经网络;由于现有的池化方案数据间的复用程度低,导致池化性能偏低。基于此,提出一种面向神经网络池化层的灵活高效的硬件设计。该设计使用Verilog硬件描述语言实现,尽可能考虑到池化算法的各项参数,进而适配最新的神经网络,采取二维拆分与多数据递进处理使其具备高兼容性;结合行缓存提高该设计的性能;乒乓缓存、伪填充及特定池化核延展进一步降低资源使用量。通过实验对多个神经网络中的池化层进行了验证,结果表明,在200?MHz的工作频率下,与CPU(AMD TR Pro 3995WX)相比,运行最大池化最高可实现536倍的加速效果;运行平均池化最高可实现11?248倍的加速效果;运行YOLOv5的池化层时,与通用的数据不复用的方案相比,可以达到以3.5倍的资源获得27倍的加速比;运行GoogleNet的池化层时,与HLS设计方案相比,可实现接近同等的资源获得555倍的加速比。

关键词: 灵活高效池化, 硬件加速, Verilog HDL, 数据复用

Abstract: In recent years, neural network accelerator has gradually become a research hotspot, among which pooling layer is an important part of neural network accelerator. Using specialized hardware design methods to design the pooling layer has the advantages of fast process and easy modification, but it also has the following problems: Different pooling design schemes cannot adapt to the latest neural networks due to lack of upward compatibility. Due to the low reuse degree of data in existing pooling schemes, the pooling performance is low. Based on this, a flexible and efficient hardware design for neural network pooling layer is proposed. The design is implemented by using Verilog hardware description language, and the parameters of the pooling algorithm are considered as much as possible to adapt to the latest neural network. It adopts two dimensional splitting and multi-data progressive processing to make it have high compatibility. Combined with line cache, the performance of the design is improved. Ping-pong caching, spurious padding, and specific pooling kernel extensions further reduce resource usage. The experimental results show that the maximum pooling can achieve up to 536 times faster than CPU (AMD TR Pro 3995WX) at 200 MHz operating frequency. The average pooling can achieve up to 11 248 times of acceleration effect. When running the pooling layer of YOLOv5, it can achieve a speedup of 27 times with 3.5 times resources compared to the common scheme without data reuse. When running the pooling layer of GoogleNet, it can achieve nearly 555 times speedup over the HLS design for comparable resources.

Key words: flexible and efficient pooling, hardware acceleration, Verilog HDL, data reuse