Algorithm Compression and Hardware Design Co-Optimization for 3D-CNN

doi:10.3778/j.issn.1002-8331.2212-0011

Abstract

Abstract: Recently, 3D convolutional neural networks have attracted significant attention due to their excellent performance in video classification. However, the enormous computing and storage requirements of 3D-CNN inevitably lead to performance and energy efficiency problems during deployment, which severely limits its applicability in scenarios with limited hardware resources. To tackle this challenge, this paper proposes an algorithm-hardware co-design and optimization method called 3D FCirCNN to deploy 3D-CNN efficiently. At the algorithm level, 3D FCirCNN uses block circulant matrix to compress 3D-CNN for the first time and further accelerates it with the fast Fourier transform（FFT）, significantly reducing the computation and storage overhead of the model while maintaining a regular network structure. On this basis, 3D FCirCNN introduces activation, batch normalization, and pooling operations in the frequency domain to eliminate the frequent time domain/frequency domain switching overhead caused by FFT. At the hardware design level, 3D FCirCNN designs a dedicated hardware architecture for the compressed 3D-CNN and makes a series of optimization oriented to hardware resources and memory bandwidth. Experiment on Xilinx ZCU102 FPGA shows that compared with the previous state-of-the-art work, 3D FCirCNN can achieve 16.68 times performance improvement and 16.18 times computational efficiency improvement within an acceptable accuracy loss （<2%）.

Key words: 3D convolutional neural networks（3D-CNN）, circulant matrix, full frequency domain, field programmable gate array（FPGA）

摘要： 近年来，三维卷积神经网络（3D-CNN）在计算机视频分类领域的优异表现使其受到了广泛关注。然而，相比于2D-CNN，3D-CNN显著增大的计算、存储需求不可避免地带来了部署时的性能与能效问题，严重限制了其在硬件资源受限场景下的适用性。为了应对该挑战，提出了一种面向3D-CNN高效部署的算法-硬件协同设计与优化方法3D FCirCNN。在算法优化层面，首次使用分块循环矩阵对3D-CNN进行压缩并且进一步通过快速傅里叶变换（fast Fourier transform，FFT）进行加速，在保证模型规则性的前提下显著降低了模型的计算和存储开销。在此基础上，引入了频域内的激活、批归一化以及池化操作，通过实现全频域推理有效消除了由于FFT所带来的时域/频域切换开销。在硬件设计层面，为分块循环矩阵压缩后的3D-CNN设计了一个专用的硬件加速架构，并作出了一系列面向硬件资源和内存带宽的优化。在Xilinx ZCU102 FPGA上的实验表明，相较于以往最先进的工作，3D FCirCNN在可接受的精度损失范围内（<2%）取得了16.68倍的性能提升和16.18倍的计算效率提升。

关键词: 三维卷积神经网络, 循环矩阵, 全频域, 现场可编程门阵列

QIAN Jiaming, LOU Wenqi, GONG Lei, WANG Chao, ZHOU Xuehai. Algorithm Compression and Hardware Design Co-Optimization for 3D-CNN[J]. Computer Engineering and Applications, 2023, 59(18): 74-83.

钱佳明, 娄文启, 宫磊, 王超, 周学海. 面向3D-CNN的算法压缩-硬件设计协同优化[J]. 计算机工程与应用, 2023, 59(18): 74-83.

References

[1] JI S，XU W，YANG M，et al.3D convolutional neural networks for human action recognition[C]//International Conference on Machine Learning，Omnipress，2010.
[2] TRAN D，BOURDEV L，FERGUS R，et al.Learning spatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2015：4489-4497.
[3] ÇIçEK Ö，ABDULKADIR A，LIENKAMP S S，et al.3D U-Net：learning dense volumetric segmentation from sparse annotation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention.Cham：Springer，2016：424-432.
[4] GARCIA-GARCIA A，GOMEZ-DONOSO F，GARCIA-RODRIGUEZ J，et al.Pointnet：a 3d convolutional neural network for real-time object class recognition[C]//2016 International Joint Conference on Neural Networks（IJCNN），2016：1578-1584.
[5] CARREIRA J，ZISSERMAN A.Quo vadis，action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：6299-6308.
[6] HARA K，KATAOKA H，SATOH Y.Learning spatio-temporal features with 3d residual networks for action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops，2017：3154-3160.
[7] WANG C，GONG L，MA X，et al.WooKong：a ubiquitous accelerator for recommendation algorithms with custom instruction sets on FPGA[J].IEEE Transactions on Computers，2020，69（7）：1071-1082.
[8] GONG L，WANG C，LI X，et al.MALOC：a fully pipelined FPGA accelerator for convolutional neural networks with all layers mapped on chip[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems，2018，37（11）：2601-2612.
[9] HEGDE K，AGRAWAL R，YAO Y，et al.Morph：flexible acceleration for 3d CNN-based video understanding[C]//2018 51st Annual IEEE/ACM International Symposium on Microarchitecture（MICRO），2018：933-946.
[10] WANG Y，WANG Y，LI H，et al.Systolic cube：a spatial 3d CNN accelerator architecture for low power video analysis[C]//Proceedings of the 56th Annual Design Automation Conference，2019：1-6.
[11] SHEN J，HUANG Y，WANG Z，et al.Towards a uniform template-based architecture for accelerating 2D and 3D CNNs on FPGA[C]//Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays，2018：97-106.
[12] WANG C，GONG L，LI X，et al.A ubiquitous machine learning accelerator with automatic parallelization on FPGA[J].IEEE Transactions on Parallel and Distributed Systems，2020，31（10）：2346-2359.
[13] SUN M，ZHAO P，GUNGOR M，et al.3D CNN acceleration on FPGA using hardware-aware pruning[C]//2020 57th ACM/IEEE Design Automation Conference（DAC），2020：1-6.
[14] DENG H，WANG J，YE H，et al.3D-VNPU：a flexible accelerator for 2D/3D CNNs on FPGA[C]//2021 IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines（FCCM），2021：181-185.
[15] FANG C，HE L，WANG H，et al.Accelerating 3D convolutional neural networks using 3D fast fourier transform[C]//2021 IEEE International Symposium on Circuits and Systems（ISCAS），2021：1-5.
[16] WU B，WANG D，ZHAO G，et al.Hybrid tensor decomposition in neural network compression[J].Neural Networks，2020，132：309-320.
[17] FAN H，NG H C，LIU S，et al.Reconfigurable acceleration of 3D-CNNs for human action recognition with block floating-point representation[C]//2018 28th International Conference on Field Programmable Logic and Applications（FPL），2018.
[18] BINI D，PAN V Y.Polynomial and matrix computations：fundamental algorithms[M].[S.l.]：Springer Science & Business Media，2012.
[19] ZHOU Z，SHI B，ZHANG Z，et al.Blockgnn：towards efficient GNN acceleration using block-circulant weight matrices[C]//2021 58th ACM/IEEE Design Automation Conference（DAC），2021：1009-1014.
[20] WANG S，LI Z，DING C，et al.C-LSTM：enabling efficient LSTM using structured compression techniques on FPGAs[C]//Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays，2018：11-20.
[21] DING C，LIAO S，WANG Y，et al.Circnn：accelerating and compressing deep neural networks using block-circulant weight matrices[C]//Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture，2017：395-408.
[22] ZHAO L，LIAO S，WANG Y，et al.Theoretical properties for neural networks with weight matrices of low displacement rank[C]//International Conference on Machine Learning，2017：4082-4090.
[23] OPPENHEIM A V，BUCK J R，SCHAFER R W.Discrete-time signal processing，Vol.2[M].Upper Saddle River，NJ：Prentice Hall，2001.
[24] CHIHEB T，BILANIUK O，SERDYUK D.Deep complex networks[C]//International Conference on Learning Representations，2017.
[25] JACOB B，KLIGYS S，CHEN B，et al.Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：2704-2713.
[26] BENGIO Y，LéONARD N，COURVILLE A.Estimating or propagating gradients through stochastic neurons for conditional computation[J].arXiv：1308.3432，2013.
[27] VESTIAS M，DUARTE R P，DE SOUSA J T，et al.Parallel dot-products for deep learning on FPGA[C]//2017 27th International Conference on Field Programmable Logic and Applications（FPL），2017：1-4.