Local and Global View Occlusion Facial Expression Recognition Method

doi:10.3778/j.issn.1002-8331.2309-0213

Abstract

Abstract: Various occlusions in the actual scene increase the difficulty of expression recognition. This paper proposes a method consisting of a local weighted convolutional attention slider and a global attention pooling vision Transformer to address the occlusion problem. It extracts facial feature maps using a backbone convolutional neural network, crops the facial feature map into multiple regions, and uses a local Patch attention unit to perceive occluded regions by adaptively calculating the attention weights of local features, extracting local facial expression features. The facial feature map is converted into Patch blocks, and the vision Transformer with Patch-level attention pooling and Token-level attention pooling is used to capture the interactions and correlations between Patch blocks from a global perspective. The guidance model emphasizes the most distinctive features while ignoring occlusion to reduce the impact of irrelevant features. Experiments on three expression datasets, their occlusion subsets, and an occlusion dataset show that the proposed model outperforms existing methods in occlusion expression recognition.

Key words: occlusion facial expression recognition, slider local convolution attention, Patch attention pooling, Token attention pooling, vision Transformer

摘要： 实际场景中各种遮挡增加了表情识别难度。为此，提出一种滑块局部加权卷积注意力和全局注意力池化的视觉Transformer结合的方法来解决遮挡问题。利用主干网络提取表情特征图，将表情特征图裁剪成多个区域块，利用局部Patch注意力单元通过自适应计算局部特征的注意力权重来感知被遮挡的区域，提取表情局部特征。同时，表情特征图转换成Patch块，通过Patch级和Token级注意力池化的视觉Transformer，从全局角度捕获Patch块之间的相互作用和相关性。引导模型强调最具区别性的特征，而忽略遮挡减少不相关特征的影响。在三个表情数据集及其遮挡子集和一个遮挡数据集上进行实验，结果表明所提模型在遮挡表情识别上优于现有方法。

关键词: 遮挡人脸表情识别, 滑块局部卷积注意力, Patch注意力池化, Token注意力池化, vision Transformer

NAN Yahui, HUA Qingyi. Local and Global View Occlusion Facial Expression Recognition Method[J]. Computer Engineering and Applications, 2024, 60(13): 180-189.

南亚会, 华庆一. 局部加全局视角遮挡人脸表情识别方法[J]. 计算机工程与应用, 2024, 60(13): 180-189.

References

[1] LI Y, ZENG J B, SHAN S G, et al. Occlusion aware facial expression recognition using CNN with attention mechanism[J]. IEEE Transactions on Image Processing, 2018, 28(5): 2439-2450.
[2] PAN B W, WANG S F, XIA B. Occluded facial expression recognition enhanced through privileged information[C]//Proceedings of the 27th ACM International Conference on Multimedia, 2019: 566-573.
[3] LIU C, HIROTA K, DAI Y P. Patch attention convolutional vision transformer for facial expression recognition with occlusion[J]. Information Sciences, 2023, 619: 781-794.
[4] GAO J X, ZHAO Y Y. TFE: a transformer architecture for occlusion aware facial expression recognition[J]. Frontiers in Neurorobotics, 2021, 15: 763100.
[5] LIAO J X, WANG X P. Self-supervised GAN for occluded facial expression recognition[C]//Proceedings of the 2021 International Conference on Neuromorphic Computing (ICNC), 2021: 386-393.
[6] 杨鲁月, 张树美, 赵俊莉. 基于并行Gan的有遮挡动态表情识别[J]. 计算机工程与应用, 2021, 57(24): 168-178.
YANG L Y, ZHANG S M, ZHAO J L. Dynamic expression recognition with partial occlusion based on parallel Gan[J]. Computer Engineering and Applications, 2021, 57(24): 168-178.
[7] WANG K, PENG X J, YANG J F, et al. Region attention networks for pose and occlusion robust facial expression recognition[J]. IEEE Transactions on Image Processing, 2020, 29: 4057-4069.
[8] YOVEL G, DUCHAINE B. Specialized face perception mechanisms extract both part and spacing information: evidence from developmental prosopagnosia[J]. Journal of Cognitive Neuroscience, 2006, 18(4): 580-593.
[9] LI S, DENG W H, DU J P. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2852-2861.
[10] ZENG J B, SHAN S G, CHEN X L. Facial expression recognition with inconsistently annotated datasets[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 222-237.
[11] CAI J, MENG Z B, KHAN A S, et al. Identity-free facial expression recognition using conditional generative adversarial network[C]//Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), 2021: 1344-1348.
[12] SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4510-4520.
[13] WANG K, PENG X J, YANG J F, et al. Suppressing uncertainties for large-scale facial expression recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 6897-6906.
[14] MAJUMDER A, BEHERA L, SUBRAMANIAN V K. Automatic facial expression recognition system using deep network-based data fusion[J]. IEEE Transactions on Cybernetics, 2016, 48(1): 103-114.
[15] ZHONG L, LIU Q S, YANG P, et al. Learning multiscale active facial patches for expression analysis[J]. IEEE Transactions on Cybernetics, 2014, 45(8): 1499-1510.
[16] ZHAO Z Q, LIU Q S, WANG S M. Learning deep global multi-scale and local attention features for facial expression recognition in the wild[J]. IEEE Transactions on Image Processing, 2021, 30: 6544-6556.
[17] LI Y, ZENG J B, SHAN S G, et al. Patch-gated CNN for occlusion-aware facial expression recognition[C]//Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), 2018: 2209-2214.
[18] LI Y J, LU G M, LI J X, et al. Facial expression recognition in the wild using multi-level features and attention mechanisms[J]. IEEE Transactions on Affective Computing, 2023, 14(1): 451-462.
[19] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017, 30.
[20] CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of the European Conference on Computer Vision, 2020: 213-229.
[21] ZHU X Z, SU W J, LU L W, et al. Deformable DETR: deformable transformers for end-to-end object detection[J]. arXiv:2010.04159, 2020.
[22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[23] D’ASCOLI S, TOUVRON H, LEAVITT M L, et al. Convit: improving vision transformers with soft convolutional inductive biases[C]//Proceedings of the International Conference on Machine Learning, 2021: 2286-2296.
[24] LIU Z, LIN Y T, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 10012-10022.
[25] MA F Y, SUN B, LI S T. Facial expression recognition with visual transformers and attentional selective fusion[J]. IEEE Transactions on Affective Computing, 2023, 14(2): 1236-1248.
[26] LI H T, SUI M Z, ZHAO F, et al. MVT: mask vision transformer for facial expression recognition in the wild[J]. arXiv:2106.04520, 2021.
[27] AOUAYEB M, HAMIDOUCHE W, SOLADIE C, et al. Learning vision transformer with squeeze and excitation for facial expression recognition[J]. arXiv:2107.03107, 2021.
[28] XUE F L, WANG Q C, GUO G D. Transfer: learning relation-aware facial expression representations with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 3601-3610.
[29] XUE F L, WANG Q C, TAN Z C, et al. Vision transformer with attentive pooling for robust facial expression recognition[J]. IEEE Transactions on Affective Computing, 2023, 14(4): 3244-3256.
[30] WANG Q C, WU T Y, ZHENG H, et al. Hierarchical pyramid diverse attention networks for face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 8326-8335.
[31] XIE S Y, HU H F, WU Y B. Deep multi-path convolutional neural network joint with salient region attention for facial expression recognition[J]. Pattern Recognition, 2019, 92: 177-191.
[32] BARSOUM E, ZHANG C, FERRER C C, et al. Training deep networks for facial expression recognition with crowd-sourced label distribution[C]//Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016: 279-283.
[33] GOODFELLOW I J, ERHAN D, CARRIER P L, et al. Challenges in representation learning: a report on three machine learning contests[C]//Proceedings of the 20th International Conference on Neural Information Processing, Daegu, Korea, Nov 3-7, 2013. Berlin, Heidelberg: Springer, 2013: 117-124.
[34] MOLLAHOSSEINI A, HASANI B, MAHOOR M H. AffectNet: a database for facial expression, valence, and arousal computing in the wild[J]. IEEE Transactions on Affective Computing, 2017, 10(1): 18-31.
[35] GUO Y D, ZHANG L, HU Y X, et al. MS-Celeb-1M: a dataset and benchmark for large-scale face recognition[C]//Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, Oct 11-14, 2016. Cham: Springer, 2016: 87-102.
[36] MA F Y, SUN B, LI S T. Robust facial expression recognition with convolutional visual transformers[J]. arXiv:2103.
16854, 2021.
[37] 袁文雪. 基于特征解耦的遮挡人脸表情识别方法研究[D]. 成都: 四川大学, 2022.
YUAN W X. Disentangled feature-based occluded facial expression recognition[D]. Chengdu: Sichuan University, 2022.
[38] 张本文, 高瑞玮, 乔少杰. 新型融合注意力机制的遮挡面部表情识别框架[J]. 重庆理工大学学报 (自然科学), 2023, 37(9): 217-226.
ZHANG B W, GAO R W, QIAO S J. A novel framework for occluded facial expression recognition by integrating attention mechanism[J]. Journal of Chongqing University of Technology (Natural Science) , 2023, 37(9): 217-226.