计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (6): 238-248.DOI: 10.3778/j.issn.1002-8331.2210-0398

• 图形图像处理 • 上一篇    下一篇

结合三维交互注意力与语义聚合的表情识别

王广宇,罗晓曙,徐照兴,丰芳宇,许江杰   

  1. 1.广西师范大学 电子与信息工程学院,广西 桂林 541004
    2.江西服装学院 大数据学院,南昌 330201
  • 出版日期:2024-03-15 发布日期:2024-03-15

Expression Recognition Combining 3D Interactive Attention and Semantic Aggregation

WANG Guangyu, LUO Xiaoshu, XU Zhaoxing, FENG Fangyu, XU Jiangjie   

  1. 1.School of Electronics and Information Engineering, Guangxi Normal University, Guilin, Guangxi 541004, China
    2.School of Mega Data, Jiangxi Institute of Fashion Technology, Nanchang 330201, China
  • Online:2024-03-15 Published:2024-03-15

摘要: 针对传统卷积网络难以有效整合不同阶段人脸面部表情的特征、存在特征表征瓶颈以及无法高效利用上下文语义等问题,提出了一种结合三维交互注意力与语义聚合的面部表情识别方法。在秩扩展(ReXNet)网络的基础上对其进行优化,在消除表征瓶颈的情况下,融合上下文特征,使其更适配表情识别任务。为捕获判别性人脸表情细粒度特征,结合非本地块与跨维度信息交互理论构建了三维交互注意力。为充分利用表情的浅中层底层特征与高层语义特征,设计了语义聚合模块,将多级全局上下文特征与高级语义信息进行聚合,达到同一类别的表情语义相互增益、增强类内一致性的目的。实验表明,该方法在公开数据集RAF-DB、FERPlus和AffectNet-8上的准确率分别为88.89%、89.53%与62.22%,展现了该方法的先进性。

关键词: 人脸表情识别, 表征瓶颈, 三维交互注意力, 上下文语义

Abstract: A facial expression recognition method combining 3D augmented attention and semantic aggregation is proposed to address the problems that traditional convolutional networks are difficult to effectively integrate features of facial expressions of faces at different stages, have feature expression bottlenecks and cannot efficiently utilize contextual semantics. Firstly, it is optimized on the basis of rank expansion (ReXNet) network to fuse contextual features while eliminating expression bottlenecks to make it more suitable for expression recognition tasks. Secondly, to capture discriminative face expression fine-grained features, 3D augmented attention is constructed by combining non-local blocks with cross-dimensional information interaction theory. Finally, in order to fully utilize the shallow and mid-level underlying features and high-level semantic features of expressions, a semantic aggregation module is designed to aggregate multi-level global contextual features with high-level semantic information to achieve mutual semantic gain of expressions of the same class and enhance intra-class consistency. Experiments show that the accuracy of the method is 88.89%, 89.53% and 62.22% on the publicly available datasets RAF-DB, FERPlus and AffectNet-8, respectively, demonstrating the advancedness of the method.

Key words: facial expression recognition, expression bottlenecks, 3D interactive attention, contextual semantics