空间加速器的受约束数据流建模与评估框架

doi:10.3778/j.issn.1002-8331.2311-0443

摘要/Abstract

摘要： 将张量计算任务部署在空间加速器上已被证明能有效提高其执行速度和效率。为了在空间加速器上高效地进行张量计算，学术界提出了一系列数据流建模与评估框架。这些框架能够快速评估数据流，以便进行高效的设计空间探索。然而，这些框架缺乏对硬件结构的细粒度描述，因此无法有效地建模硬件结构对数据流的约束，从而无法有效地探索受到真实加速器硬件结构限制的数据流设计空间。为了解决这一问题，对硬件结构进行了细粒度建模，采用多层次的空间加速器硬件结构作为模板。每一层都包括阵列结构、存储结构和互连网络结构三部分，以分别描述硬件结构对数据流在空间展开、存储容量和数据传输方式方面的限制。提出了一种计算任务和数据流建模方法，该方法能够有效地求解数据流对硬件资源的需求。基于此，提出了一个数据流评估框架，包括需求分析、约束分析和性能分析三部分。需求分析用于求解计算任务和数据流对硬件资源的需求；约束分析旨在检查数据流是否违反硬件结构约束；性能分析用于评估数据流的延迟、数据重用和资源利用率等性能指标。实验结果表明，与之前最先进的评估框架相比，提出的框架在延迟评估方面的误差有所降低，并且能够有效地支持对受限数据流设计空间的探索。

关键词: 张量计算, 空间加速器, 数据流, 建模与评估, 设计空间探索

Abstract: Deploying tensor computation tasks on spatial accelerators has been proven to effectively improve the execution speed and efficiency of tensor computations. To effectively deploy tensor computation on spatial accelerators, various dataflow modeling and evaluation frameworks have been proposed in academia. These frameworks enable quick evaluation of dataflows for efficient design space exploration. However, these frameworks lack fine-grained descriptions of the hardware structure, making it challenging to effectively model the constraints imposed by the hardware structure on the dataflow. As a result, they fail to explore the design space of dataflows constrained by real spatial accelerators effectively. To address this issue, this paper firstly provides a fine-grained modeling of the hardware architecture, using a multi-level spatial accelerator hardware structure as a template. Each level consists of three components: array structure, storage structure, and interconnect network structure, to respectively describe the constraints of the hardware architecture on spatial unfolding of data flow, storage capacity, and data transmission methods. Then, this paper proposes a tensor computation task and dataflow modeling approach that can solve the resource requirements of the dataflow. Based on this, the paper further proposes a dataflow evaluation framework, consisting of three parts：requirement analysis, constraint analysis, and performance analysis. The requirement analysis is used to determine the demands of computation tasks and dataflows on hardware resources. The constraint analysis aims to examine whether the dataflow violates hardware structure constraints. The performance analysis is used to evaluate performance metrics such as latency, data reuse, and resource utilization of the dataflow. Experimental results demonstrate that compared to the state-of-the-art evaluation framework, the proposed framework reduces the error in latency evaluation, and effectively supports the exploration of constrained dataflow design space.

Key words: tensor computation, spatial accelerator, dataflow, modeling and evaluation, design space exploration

贺裕兴, 王腾, 滕文彬, 宫磊. 空间加速器的受约束数据流建模与评估框架[J]. 计算机工程与应用, 2024, 60(17): 74-88.

HE Yuxing, WANG Teng, TENG Wenbin, GONG Lei. Modeling and Evaluation Framework for Constrained Dataflow in Spatial Accelerators[J]. Computer Engineering and Applications, 2024, 60(17): 74-88.

参考文献

[1] ANANDKUMAR A, GE R, HSU D, et al. Tensor decompositions for learning latent variable models[J]. Journal of Machine Learning Research, 2014, 15: 2773-2832.
[2] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[3] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[4] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communication of the ACM, 2012, 60: 84-90.
[5] JOUPPI N P, YOUNG C, PATIL N, et al. In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017: 1-12.
[6] LIU S, DU Z, TAO J, et al. Cambricon: an instruction set architecture for neural networks[J]. ACM SIGARCH Computer Architecture News, 2016, 44(3): 393-405.
[7] ZHAO Y, DU Z, GUO Q, et al. Cambricon-F: machine learning computers with fractal von Neumann architecture[C]//Proceedings of the 46th International Symposium on Computer Architecture, 2019: 788-801.
[8] CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks[J]. IEEE Journal of Solid-State Circuits, 2016, 52(1): 127-138.
[9] LIAO H, TU J, XIA J, et al. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: industry track paper[C]//Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture, 2021: 789-801.
[10] VENKATESAN R, SHAO Y S, WANG M, et al. MAGNet: a modular accelerator generator for neural networks[C]//Proceedings of the 2019 IEEE/ACM International Conference on Computer-Aided Design, 2019: 1-8.
[11] LAI Y H, RONG H, ZHENG S, et al. SuSy: a programming model for productive construction of high-performance systolic arrays on FPGAs[C]//Proceedings of the 39th International Conference on Computer-Aided Design, 2020: 1-9.
[12] GENC H, KIM S, AMID A, et al. Gemmini: enabling systematic deep-learning architecture evaluation via full-stack integration[C]//Proceedings of the 2021 58th ACM/IEEE Design Automation Conference, 2021: 769-774.
[13] WANG J, GUO L, CONG J. AutoSA: a polyhedral compiler for high-performance systolic arrays on FPGA[C]//Proceedings of the 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2021: 93-104.
[14] SAMAJDAR A, JOSEPH J M, ZHU Y, et al. A systematic methodology for characterizing scalability of DNN accelerators using SCALE-Sim[C]//Proceedings of the 2020 IEEE International Symposium on Performance Analysis of Systems and Software, 2020: 58-68.
[15] MUNOZ-MARTINEZ F, ABELLAN J L, ACACIO M E, et al. STONNE: enabling cycle-level microarchitectural simulation for DNN inference accelerators[C]//Proceedings of the 2021 IEEE International Symposium on Workload Characterization, 2021: 201-213.
[16] PARASHAR A, RAINA P, SHAO Y S, et al. Timeloop: a systematic approach to DNN accelerator evaluation[C]//Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software, 2019: 304-315.
[17] YANG X, GAO M, LIU Q, et al. Interstellar: using halide’s scheduling language to analyze DNN accelerators[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, 2020: 369-383.
[18] HUANG Q, KANG M, DINH G, et al. CoSA: scheduling by constrained optimization for spatial accelerators[C]//Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture, 2021: 554-566.
[19] KWON H, CHATARASI P, PELLAUER M, et al. Understanding reuse, performance, and hardware cost of DNN dataflow: a data-centric approach[C]//Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019: 754-768.
[20] LU L, GUAN N, WANG Y, et al. TENET: a framework for modeling tensor dataflow based on relation-centric notation[C]//Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture, 2021: 720-733.
[21] 钱佳明, 娄文启, 宫磊, 等. 面向3D-CNN的算法压缩-硬件设计协同优化[J]. 计算机工程与应用, 2023, 59(18): 74-83.
QIAN J M, LOU W Q, GONG L, et al. Algorithm compression and hardware design co-optimization for 3D-CNN[J]. Computer Engineering and Applications, 2023, 59(18): 74-83.
[22] 陈云霁, 李玲, 李威, 等. 智能计算系统[M]. 北京: 机械工业出版社, 2020: 234-238.
CHEN Y J, LI L, LI W, et al. AI computing systems[M]. Beijing: China Machine Press, 2020: 234-238.
[23] MARCHISIO A, HANIF M A, SHAFIQUE M. CapsAcc: an efficient hardware accelerator for capsuleNets with data reuse[C]//Proceedings of the 2019 Design, Automation & Test in Europe Conference & Exhibition, 2019: 964-967.
[24] MARCHISIO A, HANIF M A, TEIMOORI M T, et al. Capstore: energy-efficient design and management of the on-chip memory for capsulenet inference accelerators[J]. arXiv:1902.
01151, 2019.
[25] MARCHISIO A, MRAZEK V, HANIF M A, et al. DESCNet: developing efficient scratchpad memories for capsule network hardware[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 40(9): 1768-1781.
[26] MOONS B, VERHELST M. A 0. 3-2. 6 TOPS/W precision-scalable processor for real-time large-scale ConvNets[C]//Proceedings of the 2016 IEEE Symposium on VLSI Circuits, 2016: 1-2.
[27] YIN S, OUYANG P, TANG S, et al. A high energy efficient reconfigurable hybrid neural network processor for deep learning applications[J]. IEEE Journal of Solid-State Circuits, 2017, 53(4): 968-982.
[28] GENC H, HAJ-ALI A, IYER V, et al. Gemmini: an agile systolic array generator enabling systematic evaluations of deep-learning architectures[J]. arXiv:1911.09925, 2019.
[29] PARASHAR A, RHU M, MUKKARA A, et al. SCNN: an accelerator for compressed-sparse convolutional neural networks[J]. ACM SIGARCH Computer Architecture News, 2017, 45(2): 27-40.
[30] DU Z, FASTHUBER R, CHEN T, et al. ShiDianNao: shifting vision processing closer to the sensor[C]//Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015: 92-104.
[31] SIJSTERMANS F. The NVIDIA deep learning accelerator[J]. Hot Chips, 2018, 30: 19-21.
[32] WANG C, GONG L, YU Q, et al. DLAU: a scalable deep learning accelerator unit on FPGA[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2016, 36(3): 513-517.
[33] GONG L, WANG C, LI X, et al. MALOC: a fully pipelined FPGA accelerator for convolutional neural networks with all layers mapped on chip[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018, 37(11): 2601-2612.