Lightweight Face Recognition Algorithm Combining Transformer and CNN

doi:10.3778/j.issn.1002-8331.2311-0276

Abstract

Abstract: With the development of deep learning, convolutional neural networks have become the mainstream approach for face recognition (FR) by gradually expanding the receptive field through stacking convolutional layers to integrate local features. However, this approach suffers from the drawbacks of neglecting global semantic information of faces and lacking attention to important facial features, resulting in low recognition accuracy. Additionally, the stacking of a large number of parameters and layers poses challenges for deploying the network on resource-constrained devices. Therefore, a highly lightweight face recognition algorithm called gcsamTfaceNet is proposed, which combines Transformer and CNN. Firstly, a depthwise separable convolution is used to construct the backbone network in order to reduce the parameter count of the algorithm. Secondly, a channel-spatial attention mechanism is introduced to optimize the selection of features in both the channel and spatial domains, thereby improving the attention given to important facial regions. Building upon this, the Transformer module is integrated to capture the global semantic information of the feature maps, overcoming the limitations of convolutional neural networks in modeling long-range semantic dependencies and enhancing the algorithm’s ability to perceive global features. The gcsamTfaceNet, with a parameter count of only 6.5×105, is evaluated on nine validation datasets including LFW, CA-LFW, CP-LFW, CFP-FP, CFP-FF, AgeDB-30, VGG2-FP, IJB-B, and IJB-C. It achieves average accuracies of 99.67%, 95.60%, 89.32%, 93.67%, 99.65%, 96.35%, 93.36%, 89.43%, and 91.38% on these datasets, respectively. This demonstrates a good balance between parameter count and performance.

Key words: lightweight face recognition, convolutional neural network, Transformer, attention mechanism

摘要： 随着深度学习的发展，卷积神经网络通过堆叠卷积层逐步扩大感受野以融合局部特征的方式已经成为人脸识别（FR）的主流方法，但这种方法存在因忽略人脸全局语义信息和缺乏对人脸重点特征信息的关注造成识别准确率不高，以及大参数量层数的堆叠导致网络难以部署于资源受限设备的问题。因此提出一种融合Transformer和CNN的极其轻量级FR算法gcsamTfaceNet。使用深度可分离卷积构建主干网络以降低算法的参数量；引入通道-空间注意力机制，从通道和空间两个域最优化选择特征以提高对人脸重点区域的关注度；在此基础上，融合Transformer模块以捕获特征图的全局语义信息，克服卷积神经网络在长距离语义依赖性建模方面的局限性，提高算法的全局特征感知能力。参数量仅为6.5×105的gcsamTfaceNet在9个验证集（LFW、CA-LFW、CP-LFW、CFP-FP、CFP-FF、AgeDB-30、VGG2-FP、IJB-B以及IJB-C）上实验评估，分别取得99.67%、95.60%、89.32%、93.67%、99.65%、96.35%、93.36%、89.43%和91.38%的平均准确率，达到参数量和性能之间较好的权衡。

关键词: 轻量级人脸识别, 卷积神经网络, Transformer, 注意力机制

LI Ming, DANG Qingxia. Lightweight Face Recognition Algorithm Combining Transformer and CNN[J]. Computer Engineering and Applications, 2024, 60(14): 96-104.

李明, 党青霞. 融合Transformer和CNN的轻量级人脸识别算法[J]. 计算机工程与应用, 2024, 60(14): 96-104.

References

[1] ADJABI I, OUAHABI A, BENZAOUI A, et al. Past, present, and future of face recognition: a review[J]. Electronics, 2020, 9(8): 1188.
[2] SUN Y, LIANG D, WANG X G, et al. DeepID3: face recognition with very deep neural networks[J]. arXiv:1502.00873, 2015.
[3] PARKHI O M, VEDALDI A, ZISSERMAN A. Deep face recognition[C]//Proceedings of the British Machine Vision Conference, 2015.
[4] WEN Y D, ZHANG K P, LI Z F, et al. A discriminative feature learning approach for deep face recognition[C]//Proceedings of the European Conference on Computer Vision, 2016: 499-515.
[5] DENG J K, GUO J, YANG J, et al. ArcFace: additive angular margin loss for deep face recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(10): 5962-5979.
[6] LEE J, WANG Y, CHO S. Angular margin-mining softmax loss for face recognition[J]. IEEE Access, 2022, 10: 43071-43080.
[7] BOUTROS F, DAMER N, KIRCHBUCHNER F, et al. Elasticface: elastic margin loss for deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 1578-1587.
[8] ZHANG Y, HERDADE S, THADANI K, et al. Unifying margin-based softmax losses in face recognition[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023: 3548-3557.
[9] SANG M, CHEN J X, LI M Z, et al. InterFace: adjustable angular margin inter-class loss for deep face recognition[J]. arXiv:2210.02018, 2022.
[10] JIAO J C, LIU W L, MO Y K, et al. Dyn-arcFace: dynamic additive angular margin loss for deep face recognition[J]. Multimedia Tools and Applications, 2021, 80(17): 25741-25756.
[11] DALVI J, BAFNA S, BAGARIA D, et al. A survey on face recognition systems[J]. arXiv:2201.02991, 2022.
[12] BOUTROS F, DAMER N, FANG M L, et al. MixFaceNets: extremely efficient face recognition networks[C]//Proceedings of the IEEE International Joint Conference on Biometrics, 2021: 1-8.
[13] CHEN S, LIU Y, GAO X, et al. MobileFaceNets: efficient CNNs for accurate real-time face verification on mobile devices[C]//Proceedings of the Chinese Conference on Biometric Recognition, 2018: 428-438.
[14] MARTINEZ-DIAZ Y, MENDEZ-VAZQUEZ H, NICOLAS-DIAZ M, et al. ShuffleFaceNet: a lightweight face architecture for efficient and highly-accurate face recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 2721-2728.
[15] MARTINEZ-DIAZ Y, NICOLAS-DIAZ M, MENDEZ-VAZQUEZ H, et al. Benchmarking lightweight face architectures on specific face recognition scenarios[J]. Artificial Intelligence Review, 2021, 54: 6201-6244.
[16] YAN M J, ZHAO M G, XU Z N, et al. VarGFaceNet: an efficient variable group convolutional neural network for lightweight face recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2020: 2647-2654.
[17] TAN M X, LE Q V. MixConv: mixed depthwise convolutional kernels[C]//Proceedings of the British Machine Vision Conference, 2019.
[18] HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017.
[19] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet v2: practical guidelines for efficient CNN architecture design[C]//Proceedings of the European Conference on Computer Vision, 2018: 122-138.
[20] CAI H, ZHU L G, HAN S. ProxylessNAS: direct neural architecture search on target task and hardware[C]//Proceedings of the International Conference on Learning Representations, 2019.
[21] ZHANG Q, LI J J, YAO M, et al. VarGNet: variable group convolutional neural network for efficient embedded computing[J]. arXiv:1907.05653, 2019.
[22] ZHANG P, ZHAO F, LIU P, et al. Efficient lightweight attention network for face recognition[J]. IEEE Access, 2022, 10: 31740-31750.
[23] ALANSARI M, HAY O A, JAVED S, et al. GhostFaceNets: lightweight face recognition model from cheap operations[J]. IEEE Access, 2023, 11: 35429-35446.
[24] DAI Y, SUN K, HUANG W, et al. Attention-based hierarchical pyramid feature fusion structure for efficient face recognition[J]. IET Image Processing, 2023, 17(8): 2399-2409.
[25] LI H Y, HU J S, YU J W, et al. UFaceNet: research on multi-task face recognition algorithm based on CNN[J]. Algorithms, 2021, 14(9): 268.
[26] BOUTROS F, SIEBKE P, KLEMT M, et al. PocketNet: extreme lightweight face recognition network using neural architecture search and multistep knowledge distillation[J]. IEEE Access, 2022, 10: 46823-46833.
[27] WANG X B. Teacher guided neural architecture search for face recognition[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2021: 2817-2825.
[28] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017, 30: 5998-6008.
[29] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]//Proceedings of the International Conference on Learning Representations, 2021.
[30] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 10012-10022.
[31] MEHTA S, RASTEGARI M. MobileViT: light-weight, general-purpose, and mobile-friendly vision transformer[J]. arXiv:2110.02178, 2021.
[32] 杨鹤, 柏正尧. CoT-TransUNet: 轻量化的上下文Transformer医学图像分割网络[J]. 计算机工程与应用, 2023, 59(3): 218-225.
YANG H, BAI Z Y. CoT-TransUNet: lightweight context Transformer medical image segmentation network[J]. Computer Engineering and Applications, 2023, 59(3): 218-225.
[33] 项剑文, 陈泯融, 杨百冰. 结合Swin及多尺度特征融合的细粒度图像分类[J]. 计算机工程与应用, 2023, 59(20): 147-157.
XIANG J W, CHEN M R, YANG B B. Fine-grained image classification combining swin and multi-scale feature fusion[J]. Computer Engineering and Applications, 2023, 59(20): 147-157.
[34] 张朝阳, 张上, 王恒涛, 等. 多尺度下遥感小目标多头注意力检测[J]. 计算机工程与应用, 2023, 59(8): 227-238.
ZHANG C Y, ZHANG S, WANG H T, et al. Multi-head attention detection of small targets in remote sensing at multiple scales[J]. Computer Engineering and Applications, 2023, 59(8): 227-238.
[35] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision, 2018: 3-19.
[36] ZHANG K P, ZHANG Z P, LI Z F, et al. Joint face detection and alignment using multitask cascaded convolutional networks[J]. IEEE Signal Processing Letters, 2016, 23(10): 1499-1503.
[37] HUANG G B, MATTAR M, BERG T, et al. Labeled faces in the wild: a database for studying face recognition in unconstrained environments[C]//Proceedings of the Workshop on Faces in “Real-Life” Images: Detection, Alignment, and Recognition, 2008.
[38] ZHENG T Y, DENG W H, HU J N. Cross-age LFW: a database for studying cross-age face recognition in unconstrained environments[J]. arXiv:1708.08197, 2017.
[39] ZHENG T Y, DENG W H. Cross-pose LFW: a database for studying cross-pose face recognition in unconstrained environments[R]. Beijing: Beijing University of Posts and Telecommunications, 2018.
[40] SENGUPTA S, CHEN J C, CASTILLO C, et al. Frontal to profile face verification in the wild[C]//Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision, 2016: 1-9.
[41] MOSCHOGLOU S, PAPAIOANNOU A, SAGONAS C, et al. AgeDB: the first manually ected, in-the-wild age database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017: 51-59.
[42] CAO Q, SHEN L, XIE W D, et al. VGGFace2: a dataset for recognising faces across pose and age[C]//Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition, 2018: 67-74.
[43] WHITELAM C, TABORSKY E, BLANTON A, et al. IARPA Janus Benchmark?B face dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017: 592-600.
[44] MAZE B, ADAMS J, DUNCAN J A, et al. IARPA Janus Benchmark-C: face dataset and protocol[C]//Proceedings of the International Conference on Biometrics, 2018: 158-165.
[45] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[46] HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.