Optimize Dataflow of DNN on Bit-Level Composable Architecture

doi:10.3778/j.issn.1002-8331.2312-0136

Abstract

Abstract: Bit-level composable architecture is used to support neural networks with multiple data precision types. The hardware structures are variable. Besides, different applications require different data schedules. The design process is time-consuming and labor-intensive, hindering the rapid evolvement of software and hardware. The final effects are difficult to evaluate. Related works lack the bit-level consideration and automation. A schedule optimization method for bit-level composable architecture based on dataflow modeling is proposed to solve the problems. Dataflow modeling including different loop statements and a tensor-index matrix is introduced to describe the hardware structure and the scheduling process. Data access information and data reuse amount are quickly evaluated from dataflow representations. Based on the model, a design space exploration method is built to automatically design the schedule for different applications and hardware constraints. Pruning strategies are used to reduce design space and promote exploration efficiency. The experimental result shows that under different applications and hardware constraints, the method achieves better performance results compared to other accelerators and schedules.

Key words: deep neural network accelerator, precision scalable, dataflow, design space exploration

摘要： 位级可组合架构用于支持有多种数据位宽类型的神经网络计算。其硬件结构有较多变体，面对不同神经网络模型需额外设计程序调度。过程耗时，阻碍软硬件的快速迭代和部署，效果难以评估。相关的数据流建模工作缺乏位级计算描述和自动化方法。提出了基于数据流建模的自适应位级可组合架构上的数据调度优化方法解决上述问题。引入位级数据流建模，以多种循环原语和张量-索引关系矩阵，描述位级可组合硬件结构的特征和应用的数据调度过程。从建模表达中提取数据访问信息，统计数据复用情况，进行快速评估。构建了设计空间探索框架，针对不同应用和硬件设计约束自适应优化数据调度过程。利用索引匹配方法和循环变换方法进行设计采样，添加贪心规则进行剪枝，以提高探索效率。在多个应用程序和多种硬件结构约束下进行实验。结果表明对比先进的手动设计的加速器和数据调度，获得了更好的性能表现。

关键词: 神经网络加速器, 可变位宽, 数据流, 设计空间探索

GAO Hanyuan, GONG Lei, WANG Teng. Optimize Dataflow of DNN on Bit-Level Composable Architecture[J]. Computer Engineering and Applications, 2024, 60(18): 147-157.

高汉源, 宫磊, 王腾. DNN在位级可组合架构上的数据流优化方法[J]. 计算机工程与应用, 2024, 60(18): 147-157.

References

[1] UMUROGLU Y, RASNAYAKE L, SJ?LANDER M. Bismo: a scalable bit-serial matrix multiplication overlay for reconfigurable computing[C]//2018 28th International Conference on Field Programmable Logic and Applications (FPL), 2018.
[2] RYU S, KIM H, YI W, et al. Bitblade: area and energy-efficient precision-scalable neural network accelerator with bitwise summation[C]//Proceedings of the 56th Annual Design Automation Conference, 2019: 1-6.
[3] YANG Q, LI H. BitSystolic: a 26.7?TOPS/W 2b~ 8b NPU with configurable data flows for edge devices[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2020, 68(3): 1134-1145.
[4] SHARMA H, PARK J, SUDA N, et al. Bit fusion: bit-level dynamically composable architecture for accelerating deep neural network[C]//2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018: 764-775.
[5] JUDD P, ALBERICIO J, HETHERINGTON T, et al. Stripes: bit?serial deep neural network computing[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016: 1-12.
[6] GHOLAMI A, KIM S, DONG Z, et al. A survey of quantization methods for efficient neural network inference[J]. arXiv:2103.13630, 2021.
[7] PARASHAR A, RAINA P, SHAO Y S, et al. Timeloop: a systematic approach to DNN accelerator evaluation[C]//2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2019: 304-315.
[8] KWON H, CHATARASI P, PELLAUER M, et al. Understanding reuse, performance, and hardware cost of DNN dataflow: a data-centric approach[C]//Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019: 754-768.
[9] KWON H, CHATARASI P, SARKAR V, et al. Maestro: a data-centric approach to understand reuse, performance, and hardware cost of dnn mappings[J]. IEEE Micro, 2020, 40(3): 20-29.
[10] LU L, GUAN N, WANG Y, et al. Tenet: a framework for modeling tensor dataflow based on relation-centric notation[C]//2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021: 720-733.
[11] IBRAHIM E M, MEI L, VERHELST M. Taxonomy and benchmarking of precision-scalable MAC arrays under enhanced DNN dataflow representation[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2022, 69(5): 2013-2024.
[12] CHEN Y, LUO T, LIU S, et al. Dadiannao: a machine-learning supercomputer[C]//2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014: 609-622.
[13] LI S, CHEN K, AHN J H, et al. CACTI-P: architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques[C]//2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2011: 694-701.
[14] LECUN Y, BOTTOU L, BENGIO Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
[15] NETZER Y, WANG T, COATES A, et al. Reading digits in natural images with unsupervised feature learning[C]//NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
[16] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[17] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[18] KRIZHEVSKY A, HINTON G.Learning multiple layers of features from tiny images[J].Handbook of Systemic Autoimmune Diseases, 2009, 1(4).
[19] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv:1409. 1556, 2014.
[20] LIPTON Z C, BERKOWITZ J, ELKAN C. A critical review of recurrent neural networks for sequence learning[J]. arXiv:1506.00019, 2015.
[21] HUBARA I, COURBARIAUX M, SOUDRY D, et al. Quantized neural networks: training neural networks with low precision weights and activations[J]. The Journal of Machine Learning Research, 2017, 18(1): 6869-6898.
[22] MISHRA A, NURVITADHI E, COOK J J, et al. WRPN: wide reduced-precision networks[J]. arXiv:1709.01134, 2017.
[23] GHODRATI S, SHARMA H, YOUNG C, et al. Bit-parallel vector composability for neural acceleration[C]//2020 57th ACM/IEEE Design Automation Conference (DAC), 2020: 1-6.