计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (18): 142-156.DOI: 10.3778/j.issn.1002-8331.2403-0291

• 模式识别与人工智能 • 上一篇    下一篇

关键点引导与显著帧增强的情感识别网络

黄忠,张丹妮,任福继,胡敏,刘娟   

  1. 1.安庆师范大学 电子工程与智能制造学院,安徽 安庆 246133
    2.合肥工业大学 计算机与信息学院,合肥 230009
    3.电子科技大学 计算机科学与工程学院,成都 610056
  • 出版日期:2025-09-15 发布日期:2025-09-15

Key-Points Guidance and Significant-Frames Enhancement for Emotion Recognition Network

HUANG Zhong, ZHANG Danni, REN Fuji, HU Min, LIU Juan   

  1. 1.School of Electronic Engineering and Intelligent Manufacturing, Anqing Normal University, Anqing,Anhui 246133, China
    2.School of Computer Science and Information, Hefei University of Technology, Hefei 230009, China
    3.School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610056, China
  • Online:2025-09-15 Published:2025-09-15

摘要: 针对表情与姿态两类情感线索空间占比差异及时间峰值异步的问题,提出一种关键点引导与显著帧增强的情感识别网络。在空间关键点引导子网中,为捕获帧内表情-姿态的情感相关性和互补性信息,基于跨模态注意力和残差结构构建空间关键点引导机制分别获取表情引导语义和姿态引导语义。在时间显著帧增强子网中,为了减少帧间表情-姿态情感峰值异步带来的冗余信息,根据表情引导语义和姿态引导语义度量情感区分度和离散度,提出时间显著帧增强策略实现引导语义序列的时空特征增强。在FABO和CAER视频数据集上的实验结果表明:提出网络的情感识别准确率分别达到95.31%和89.78%,比基线网络分别提高了11.50和13.66个百分点;与相关方法相比,提出方法在自然场景动态视频数据集和静态图片数据集上均具有较好的情感识别性能。

关键词: 面部表情, 肢体姿态, 情感识别, 关键点引导, 显著帧增强

Abstract: Aiming at the problems of spatial proportion difference and temporal peak asynchrony of emotion cues between facial expression and bodily posture, a key-points guidance and significant-frames enhancement (KGSE-ER) network is proposed for emotion recognition. In the spatial key-points guidance subnet, to capture the intra-frame emotional correlation and complementary information between facial expression and bodily posture, a spatial key-points guidance mechanism is constructed to obtain facial expression guidance semantics and bodily posture guidance semantics based on a cross-modal attention and a residual structure. In the temporal significant-frames enhancement subnet, to reduce the inter-frame redundant information caused by temporal peak asynchrony between facial expression and bodily posture, the emotional discrimination and dispersion are measured according to the facial expression guidance semantics and the bodily posture guidance semantics, and a temporal significant-frames enhancement strategy is proposed to enhance spatiotemporal features of cue-guided semantic sequences. The experimental results on FABO and CAER video datasets show that the emotion recognition accuracy of the proposed network reaches 95.31% and 89.78% respectively, which is 11.50 and 13.66 percentage points higher than the baseline network. Compared with related methods, the proposed network has better emotion recognition performance on both natural scene dynamic video datasets and static image datasets.

Key words: facial expression, bodily posture, emotion recognition, key-points guidance, significant-frames enhancemen