计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (14): 96-104.DOI: 10.3778/j.issn.1002-8331.2311-0276

• 模式识别与人工智能 • 上一篇    下一篇

融合Transformer和CNN的轻量级人脸识别算法

李明,党青霞   

  1. 1.武汉纺织大学 湖北省服装信息化工程技术研究中心,武汉 430200
    2.武汉纺织大学 湖北省数字化纺织装备重点实验室,武汉 430200
  • 出版日期:2024-07-15 发布日期:2024-07-15

Lightweight Face Recognition Algorithm Combining Transformer and CNN

LI Ming, DANG Qingxia   

  1. 1.Engineering Research Center of Hubei Province for Clothing Information,Wuhan Textile University, Wuhan 430200, China
    2.Hubei Key Laboratory of Digital Textile Equipment, Wuhan Textile University, Wuhan 430200, China
  • Online:2024-07-15 Published:2024-07-15

摘要: 随着深度学习的发展,卷积神经网络通过堆叠卷积层逐步扩大感受野以融合局部特征的方式已经成为人脸识别(FR)的主流方法,但这种方法存在因忽略人脸全局语义信息和缺乏对人脸重点特征信息的关注造成识别准确率不高,以及大参数量层数的堆叠导致网络难以部署于资源受限设备的问题。因此提出一种融合Transformer和CNN的极其轻量级FR算法gcsamTfaceNet。使用深度可分离卷积构建主干网络以降低算法的参数量;引入通道-空间注意力机制,从通道和空间两个域最优化选择特征以提高对人脸重点区域的关注度;在此基础上,融合Transformer模块以捕获特征图的全局语义信息,克服卷积神经网络在长距离语义依赖性建模方面的局限性,提高算法的全局特征感知能力。参数量仅为6.5×105的gcsamTfaceNet在9个验证集(LFW、CA-LFW、CP-LFW、CFP-FP、CFP-FF、AgeDB-30、VGG2-FP、IJB-B以及IJB-C)上实验评估,分别取得99.67%、95.60%、89.32%、93.67%、99.65%、96.35%、93.36%、89.43%和91.38%的平均准确率,达到参数量和性能之间较好的权衡。

关键词: 轻量级人脸识别, 卷积神经网络, Transformer, 注意力机制

Abstract: With the development of deep learning, convolutional neural networks have become the mainstream approach for face recognition (FR) by gradually expanding the receptive field through stacking convolutional layers to integrate local features. However, this approach suffers from the drawbacks of neglecting global semantic information of faces and lacking attention to important facial features, resulting in low recognition accuracy. Additionally, the stacking of a large number of parameters and layers poses challenges for deploying the network on resource-constrained devices. Therefore, a highly lightweight face recognition algorithm called gcsamTfaceNet is proposed, which combines Transformer and CNN. Firstly, a depthwise separable convolution is used to construct the backbone network in order to reduce the parameter count of the algorithm. Secondly, a channel-spatial attention mechanism is introduced to optimize the selection of features in both the channel and spatial domains, thereby improving the attention given to important facial regions. Building upon this, the Transformer module is integrated to capture the global semantic information of the feature maps, overcoming the limitations of convolutional neural networks in modeling long-range semantic dependencies and enhancing the algorithm’s ability to perceive global features. The gcsamTfaceNet, with a parameter count of only 6.5×105, is evaluated on nine validation datasets including LFW, CA-LFW, CP-LFW, CFP-FP, CFP-FF, AgeDB-30, VGG2-FP, IJB-B, and IJB-C. It achieves average accuracies of 99.67%, 95.60%, 89.32%, 93.67%, 99.65%, 96.35%, 93.36%, 89.43%, and 91.38% on these datasets, respectively. This demonstrates a good balance between parameter count and performance.

Key words: lightweight face recognition, convolutional neural network, Transformer, attention mechanism