计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (13): 180-189.DOI: 10.3778/j.issn.1002-8331.2309-0213

• 模式识别与人工智能 • 上一篇    下一篇

局部加全局视角遮挡人脸表情识别方法

南亚会,华庆一   

  1. 1.西北大学 信息科学与技术学院,西安 710127
    2.吕梁学院 计算机科学与技术系,山西 吕梁 033001
  • 出版日期:2024-07-01 发布日期:2024-07-01

Local and Global View Occlusion Facial Expression Recognition Method

NAN Yahui, HUA Qingyi   

  1. 1.College of Information Science and Technology, Northwest University, Xi’an 710127, China
    2.Department of Computer Science and Technology, Lyuliang University, Lyuliang, Shanxi 033001, China
  • Online:2024-07-01 Published:2024-07-01

摘要: 实际场景中各种遮挡增加了表情识别难度。为此,提出一种滑块局部加权卷积注意力和全局注意力池化的视觉Transformer结合的方法来解决遮挡问题。利用主干网络提取表情特征图,将表情特征图裁剪成多个区域块,利用局部Patch注意力单元通过自适应计算局部特征的注意力权重来感知被遮挡的区域,提取表情局部特征。同时,表情特征图转换成Patch块,通过Patch级和Token级注意力池化的视觉Transformer,从全局角度捕获Patch块之间的相互作用和相关性。引导模型强调最具区别性的特征,而忽略遮挡减少不相关特征的影响。在三个表情数据集及其遮挡子集和一个遮挡数据集上进行实验,结果表明所提模型在遮挡表情识别上优于现有方法。

关键词: 遮挡人脸表情识别, 滑块局部卷积注意力, Patch注意力池化, Token注意力池化, vision Transformer

Abstract: Various occlusions in the actual scene increase the difficulty of expression recognition. This paper proposes a method consisting of a local weighted convolutional attention slider and a global attention pooling vision Transformer to address the occlusion problem. It extracts facial feature maps using a backbone convolutional neural network, crops the facial feature map into multiple regions, and uses a local Patch attention unit to perceive occluded regions by adaptively calculating the attention weights of local features, extracting local facial expression features. The facial feature map is converted into Patch blocks, and the vision Transformer with Patch-level attention pooling and Token-level attention pooling is used to capture the interactions and correlations between Patch blocks from a global perspective. The guidance model emphasizes the most distinctive features while ignoring occlusion to reduce the impact of irrelevant features. Experiments on three expression datasets, their occlusion subsets, and an occlusion dataset show that the proposed model outperforms existing methods in occlusion expression recognition.

Key words: occlusion facial expression recognition, slider local convolution attention, Patch attention pooling, Token attention pooling, vision Transformer