计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (13): 77-84.DOI: 10.3778/j.issn.1002-8331.2010-0223

• 理论与研发 • 上一篇    下一篇

面向卷积神经网络的硬件加速器设计方法

孙明,陈昕   

  1. 同济大学 电子与信息工程学院,上海 201804
  • 出版日期:2021-07-01 发布日期:2021-06-29

Design Method of Convolutional Neural Network Accelerator

SUN Ming, CHEN Xin   

  1. School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
  • Online:2021-07-01 Published:2021-06-29

摘要:

为满足实际应用对卷积神经网络(CNN)推理的低时延、小体积和高吞吐率等要求,设计了一个采用如下优化方法的加速器:针对外存访问带宽限制,基于设计空间探索确定循环分块因子以最大化数据重用;针对CNN计算密度高,采用循环展开技术充分挖掘四种计算并行度;内存池、乒乓缓存和动态数据量化等技术用于管理片内外存储资源。将生成加速器流程封装为CNN加速框架;采用生成的加速器实现了AlexNet网络,仿真结果表明,该设计最高可达1?493.4?Gops的计算峰值,是被比较工作的多达24.2倍,DSP效率也超过了其他设计方法,最低为1.2倍,实现了CNN快速部署,开发效率高,加速性能优异。

关键词: 卷积神经网络(CNN), 加速器, 并行计算, 设计空间探索, 乒乓缓存, 数据重用

Abstract:

In order to meet the requirements of low latency, small size and high throughput for Convolutional Neural Network(CNN) inference in practical applications, an accelerator is designed that uses the following optimization methods:for the storage access bandwidth limitation, the loop tiling factor is determined based on the design space exploration to improve the degree of data reuse; for the high computation density of CNN, it uses the loop unrolling technology to fully exploit the four kinds of computing parallelism; technologies such as memory pool, ping-pong cache, and dynamic data quantization are used to manage on-chip and off-chip storage resources. In addition, the process of generating accelerators is packaged as a CNN acceleration framework. Finally, the generated accelerator is used to implement the AlexNet network, the simulation results show that maximum computing throughput of this design is 1,493.4?Gops, which is up to 24.2 times of compared works, DSP efficiency exceeds other design methods, the lowest is 1.2 times. This paper achieves the rapid deployment of CNN, high development efficiency, and excellent acceleration performance.

Key words: Convolutional Neural Network(CNN), accelerator, computing parallelism, design space exploration, ping pong cache, data reuse