Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (22): 48-54.DOI: 10.3778/j.issn.1002-8331.1912-0384

Previous Articles     Next Articles

Optimized Design and FPGA Implementation of High-Performance Face Recognition Accelerator

WU Jin, ZHANG Weihua, XI Meng, DAI Wei   

  1. School of Electronic Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
  • Online:2020-11-15 Published:2020-11-13

高性能人脸识别加速器优化设计及FPGA实现

吴进,张伟华,席萌,代巍   

  1. 西安邮电大学 电子工程学院,西安 710121

Abstract:

The rapid development of computer vision requires higher and higher system performance of embedded products, traditional Field Programmable Gate Array(FPGA) platform has some problems that computational throughput does not match the memory bandwidth well, the implementation efficiency of general processor pair Convolutional Neural Network(CNN) is not high, and the performance requirements are not met. Aiming at above design bottlenecks, using the classic LeNet-5 neural network model, a high-performance face recognition neural network accelerator is designed on the Xilinx ZC706 embedded development platform, which is optimized by storage based on High Level Synthesis(HLS) tools. The fixed-point quantization, computational optimization and other aspects of the neural network model are optimized and improved, and the 7-layer CNN accelerator is realized. Experimental results show that the operating frequency of CNN accelerator is 200 MHz. Compared with the CPU, the accelerator achieves 126 times acceleration, which is more than ten times faster than the GPU speed, and the power consumption is only 2.62 W.

Key words: CNN accelerator, Field Programmable Gate Array(FPGA), High Level Synthesis(HLS), storage optimization, fixed point quantization

摘要:

计算机视觉的快速发展对嵌入式产品的系统性能要求越来越高,传统的现场可编程门阵列(Field Programmable Gate Array,FPGA)平台存在计算吞吐未能很好匹配内存带宽,通用处理器对卷积神经网络(Convolutional Neural Network,CNN)的实现效率不高,未能满足性能要求等问题。针对以上设计瓶颈,使用经典的LeNet-5神经网络模型,在Xilinx ZC706嵌入式开发平台上设计了一个高性能的人脸识别神经网络加速器,在高层次综合(High Level Synthesis,HLS)工具的基础上通过存储优化、定点量化、运算优化等方法对神经网络模型进行优化改进,实现了7层的CNN加速器。实验结果表明,CNN加速器的工作频率为200 MHz,相较于CPU,加速器实现了126倍加速,相较于GPU速度提升10倍以上,并且功耗仅为2.62 W。

关键词: CNN加速器, 现场可编程门阵列(FPGA), 高层次综合(HLS), 存储优化, 定点量化