计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (20): 293-301.DOI: 10.3778/j.issn.1002-8331.2306-0206

• 工程与应用 • 上一篇    下一篇

基于注视转移学习的视频注视目标检测

杨兴明,史俊彪,李自强,吴克伟,谢昭   

  1. 1.大数据知识工程教育部重点实验室(合肥工业大学),合肥 230601
    2.合肥工业大学 计算机与信息学院,合肥 230601
  • 出版日期:2024-10-15 发布日期:2024-10-15

Learning Gaze Transition for Gaze Target Detection in Video

YANG Xingming, SHI Junbiao, LI Ziqiang, WU Kewei, XIE Zhao   

  1. 1.Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology), Ministry of Education, Hefei 230601, China
    2.School of Computer and Information Engineering, Hefei University of Technology, Heifei 230601, China
  • Online:2024-10-15 Published:2024-10-15

摘要: 视频注视目标检测,需要估计视频帧中的人所注视目标的位置。在不同的时间,人会注视不同的目标。在两个注视目标转移的时间段内,人并没有注视特定的目标。基于图像Transformer的注视目标检测方法,忽略了抑制注视转移现象。注视转移中的注视方向,会干扰注视目标的真实位置估计。为了实现视频注视目标检测,提出一种基于注视转移的模型,该模型包括注视方向引导模块,注视转移时间融合模块。在注视方向引导模块中,注视目标位置被用于估计注视方向热图。该模块使用注视方向热图来引导注视目标热图生成,这有利于抑制非注视方向的目标响应,提高注视目标定位的准确性。在注视转移时间融合模块中,注视目标热图随着时间变化会产生时空热图。该模块对时空热图采用双向时空卷积长短期记忆网络(LSTM),产生时空记忆融合的注视目标热图,来描述时空热图中注视目标的变化过程。该模块将注视转移时间段描述为高斯时间模型。针对注视转移的时间长度不确定的问题,该模块设计高斯时间融合方法,来估计出注视转移的视时间长度和注视转移的开始和结束时间。注视转移时间段的准确定位,抑制了注视转移现象对注视目标位置估计的干扰。该模型训练使用了注视方向损失、注视目标存在损失、注视目标热图损失,以及注视转移时间定位损失。实验采用GazeFollow和VideoAttentionTarget数据集。实验结果表明基于注视转移的模型,优于基于图像Transformer的注视目标检测方法。

关键词: 注视目标检测, 注视转移, 注视目标热图, 时空卷积长短期记忆网络, 高斯时间融合

Abstract: Gaze target detection in the video aims to localize the gaze target in each video frame. The person gazes at different targets at different times. In the transition segment from one gaze target to gaze at another, the person may not gaze at a specific target. The gaze target detection method with an image transformer neglects to consider the temporal transition segment. The gaze direction in the transition segment may hinder the gaze target detection in the video. For gaze target detection in video, this paper proposes a gaze transition-based model, which contains a gaze direction guidance module, and a gaze transition temporal fusion module. In the gaze direction guidance module, the position of the gaze target is used to learn the heatmap of the gaze direction. The gaze target is detected by guiding with the heatmap of the gaze direction, which can suppress the target out of the gaze direction and predict the accurate position of the gaze target. In the gaze transition temporal fusion module, the heatmap in multiple frames forms the spatial-temporal heatmap. To learn the changes in the spatial-temporal heatmap, this paper uses bi-directional spatial-temporal convolution long short-term memory (LSTM), which can extract the memory-based spatial-temporal heatmap. The gaze transition is described by introducing the Gaussian-based temporal model. To localize the temporal segment of the gaze transition with uncertainty temporal length, this paper designs a Gaussian-based temporal fusion method, which can estimate the gaze transition with the start timestamp, the end timestamp, and the temporal length. By localizing the gaze transition segment, the transition effect can be removed for gaze target detection. Gaze transition-based model is trained with gaze direction-based loss, gaze target existence loss, gaze target heatmap loss, and gaze transition temporal localization loss. In the GazeFollow dataset and VideoAttentionTarget dataset, the experimental results show that the gaze transition-based model outperforms the image transformer-based model for gaze target detection in video.

Key words: gaze target detection, gaze transition, gaze target heatmap, spatial-temporal convolution long short-term memory, Gaussian-based temporal fusion