计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (4): 252-257.DOI: 10.3778/j.issn.1002-8331.1912-0099

• 工程与应用 • 上一篇    下一篇

面向嵌入式的卷积神经网络硬件加速器设计

唐蕊,焦继业,徐华昊   

  1. 西安邮电大学 计算机学院,西安 710121
  • 出版日期:2021-02-15 发布日期:2021-02-06

Design of Hardware Accelerator for Embedded Convolutional Neural Network

TANG Rui, JIAO Jiye, XU Huahao   

  1. School of Computer Science & Technology, Xi’an University of Posts & Telecommunications, Xi’an 710121, China
  • Online:2021-02-15 Published:2021-02-06

摘要:

近年来,随着神经网络模型越来越复杂,针对卷积神经网络推理计算所需内存空间过大,限制其在嵌入式设备上部署的问题,提出一种动态多精度定点数据量化硬件结构,使用定点数代替训练后推理过程中的浮点数执行卷积运算。结果表明,采用16位动态定点量化和并行卷积运算硬件架构,与静态量化策略相比,数据准确率高达97.96%,硬件单元的面积仅为13 740门,且内存占用量和带宽需求减半。相比Cortex M4使用浮点数据做卷积运算,该硬件加速单元性能提升了90%以上。

关键词: 卷积神经网络, 嵌入式设备, 动态多精度定点数据量化, 并行卷积运算硬件架构

Abstract:

In recent years, neural network models become more and more complex. Aiming at the large memory space required for convolutional neural network inference calculations, which limits its deployment on embedded devices, a dynamic multi-precision fixed-point data quantization hardware structure is proposed. It uses fixed-point data instead of floating-point data during neural network inference to perform convolutional operations. The results show that compared with the static quantization strategy, using a 16 bit fixed-point dynamic quantization and parallel convolutional operation hardware architecture, data accuracy is up to 97.96%. The hardware unit area is only 13740 gates, and the memory footprint and bandwidth requirement are reduced 50%. In addition, compared with Cortex M4, which performs convolutional operations using floating-point data, the embedded system SoC performance is improved more than 90%.

Key words: convolutional neural network, embedded devices, dynamic multi-precision fixed-point data quantization, parallel convolutional operation hardware architecture