计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (18): 74-83.DOI: 10.3778/j.issn.1002-8331.2212-0011

• 理论与研发 • 上一篇    下一篇

面向3D-CNN的算法压缩-硬件设计协同优化

钱佳明,娄文启,宫磊,王超,周学海   

  1. 1.中国科学技术大学 计算机科学与技术学院,合肥 230027
    2.中国科学技术大学 苏州高等研究院,江苏 苏州 215123
  • 出版日期:2023-09-15 发布日期:2023-09-15

Algorithm Compression and Hardware Design Co-Optimization for 3D-CNN

QIAN Jiaming, LOU Wenqi, GONG Lei, WANG Chao, ZHOU Xuehai   

  1. 1.School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China
    2.Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, Jiangsu 215123, China
  • Online:2023-09-15 Published:2023-09-15

摘要: 近年来,三维卷积神经网络(3D-CNN)在计算机视频分类领域的优异表现使其受到了广泛关注。然而,相比于2D-CNN,3D-CNN显著增大的计算、存储需求不可避免地带来了部署时的性能与能效问题,严重限制了其在硬件资源受限场景下的适用性。为了应对该挑战,提出了一种面向3D-CNN高效部署的算法-硬件协同设计与优化方法3D FCirCNN。在算法优化层面,首次使用分块循环矩阵对3D-CNN进行压缩并且进一步通过快速傅里叶变换(fast Fourier transform,FFT)进行加速,在保证模型规则性的前提下显著降低了模型的计算和存储开销。在此基础上,引入了频域内的激活、批归一化以及池化操作,通过实现全频域推理有效消除了由于FFT所带来的时域/频域切换开销。在硬件设计层面,为分块循环矩阵压缩后的3D-CNN设计了一个专用的硬件加速架构,并作出了一系列面向硬件资源和内存带宽的优化。在Xilinx ZCU102 FPGA上的实验表明,相较于以往最先进的工作,3D FCirCNN在可接受的精度损失范围内(<2%)取得了16.68倍的性能提升和16.18倍的计算效率提升。

关键词: 三维卷积神经网络, 循环矩阵, 全频域, 现场可编程门阵列

Abstract: Recently, 3D convolutional neural networks have attracted significant attention due to their excellent performance in video classification. However, the enormous computing and storage requirements of 3D-CNN inevitably lead to performance and energy efficiency problems during deployment, which severely limits its applicability in scenarios with limited hardware resources. To tackle this challenge, this paper proposes an algorithm-hardware co-design and optimization method called 3D FCirCNN to deploy 3D-CNN efficiently. At the algorithm level, 3D FCirCNN uses block circulant matrix to compress 3D-CNN for the first time and further accelerates it with the fast Fourier transform(FFT), significantly reducing the computation and storage overhead of the model while maintaining a regular network structure. On this basis, 3D FCirCNN introduces activation, batch normalization, and pooling operations in the frequency domain to eliminate the frequent time domain/frequency domain switching overhead caused by FFT. At the hardware design level, 3D FCirCNN designs a dedicated hardware architecture for the compressed 3D-CNN and makes a series of optimization oriented to hardware resources and memory bandwidth. Experiment on Xilinx ZCU102 FPGA shows that compared with the previous state-of-the-art work, 3D FCirCNN can achieve 16.68 times performance improvement and 16.18 times computational efficiency improvement within an acceptable accuracy loss (<2%).

Key words: 3D convolutional neural networks(3D-CNN), circulant matrix, full frequency domain, field programmable gate array(FPGA)