Learning Gaze Transition for Gaze Target Detection in Video

doi:10.3778/j.issn.1002-8331.2306-0206

Abstract

Abstract: Gaze target detection in the video aims to localize the gaze target in each video frame. The person gazes at different targets at different times. In the transition segment from one gaze target to gaze at another, the person may not gaze at a specific target. The gaze target detection method with an image transformer neglects to consider the temporal transition segment. The gaze direction in the transition segment may hinder the gaze target detection in the video. For gaze target detection in video, this paper proposes a gaze transition-based model, which contains a gaze direction guidance module, and a gaze transition temporal fusion module. In the gaze direction guidance module, the position of the gaze target is used to learn the heatmap of the gaze direction. The gaze target is detected by guiding with the heatmap of the gaze direction, which can suppress the target out of the gaze direction and predict the accurate position of the gaze target. In the gaze transition temporal fusion module, the heatmap in multiple frames forms the spatial-temporal heatmap. To learn the changes in the spatial-temporal heatmap, this paper uses bi-directional spatial-temporal convolution long short-term memory (LSTM), which can extract the memory-based spatial-temporal heatmap. The gaze transition is described by introducing the Gaussian-based temporal model. To localize the temporal segment of the gaze transition with uncertainty temporal length, this paper designs a Gaussian-based temporal fusion method, which can estimate the gaze transition with the start timestamp, the end timestamp, and the temporal length. By localizing the gaze transition segment, the transition effect can be removed for gaze target detection. Gaze transition-based model is trained with gaze direction-based loss, gaze target existence loss, gaze target heatmap loss, and gaze transition temporal localization loss. In the GazeFollow dataset and VideoAttentionTarget dataset, the experimental results show that the gaze transition-based model outperforms the image transformer-based model for gaze target detection in video.

Key words: gaze target detection, gaze transition, gaze target heatmap, spatial-temporal convolution long short-term memory, Gaussian-based temporal fusion

摘要： 视频注视目标检测，需要估计视频帧中的人所注视目标的位置。在不同的时间，人会注视不同的目标。在两个注视目标转移的时间段内，人并没有注视特定的目标。基于图像Transformer的注视目标检测方法，忽略了抑制注视转移现象。注视转移中的注视方向，会干扰注视目标的真实位置估计。为了实现视频注视目标检测，提出一种基于注视转移的模型，该模型包括注视方向引导模块，注视转移时间融合模块。在注视方向引导模块中，注视目标位置被用于估计注视方向热图。该模块使用注视方向热图来引导注视目标热图生成，这有利于抑制非注视方向的目标响应，提高注视目标定位的准确性。在注视转移时间融合模块中，注视目标热图随着时间变化会产生时空热图。该模块对时空热图采用双向时空卷积长短期记忆网络（LSTM），产生时空记忆融合的注视目标热图，来描述时空热图中注视目标的变化过程。该模块将注视转移时间段描述为高斯时间模型。针对注视转移的时间长度不确定的问题，该模块设计高斯时间融合方法，来估计出注视转移的视时间长度和注视转移的开始和结束时间。注视转移时间段的准确定位，抑制了注视转移现象对注视目标位置估计的干扰。该模型训练使用了注视方向损失、注视目标存在损失、注视目标热图损失，以及注视转移时间定位损失。实验采用GazeFollow和VideoAttentionTarget数据集。实验结果表明基于注视转移的模型，优于基于图像Transformer的注视目标检测方法。

关键词: 注视目标检测, 注视转移, 注视目标热图, 时空卷积长短期记忆网络, 高斯时间融合

YANG Xingming, SHI Junbiao, LI Ziqiang, WU Kewei, XIE Zhao. Learning Gaze Transition for Gaze Target Detection in Video[J]. Computer Engineering and Applications, 2024, 60(20): 293-301.

杨兴明, 史俊彪, 李自强, 吴克伟, 谢昭. 基于注视转移学习的视频注视目标检测[J]. 计算机工程与应用, 2024, 60(20): 293-301.

References

[1] SHI L, COPOT C, VANLANDUIT S. GazeEMD: detecting visual intention in gaze-based human-robot interaction[J]. Robotics, 2021, 10(2): 68.
[2] CHAUDARY A K, NAIR N, BAILEY R J, et al. From real infrared eye-images to synthetic sequences of gaze behavior[J]. IEEE Transactions on Visualization and Computer Graphics, 2022, 28(11): 3948-3958.
[3] MIRSADIKOV A, GEORGE J F. Can you see me lying? investigating the role of deception on gaze behavior[J]. International Journal of Human-Computer Studies, 2023, 174: 103010.
[4] WANG S, OOYANG X, LIU T, et al. Follow my eye: using gaze to supervise computer-aided diagnosis[J]. IEEE Transactions on Medical Imaging, 2022, 41(7): 1688-1698.
[5] TU D, MIN X, DUAN H, et al. End-to-end human-gaze-target detection with transformers[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 2192-2200.
[6] RECASENS A, KHOSLA A, VONDRICK C, et al. Where are they looking?[C]//Advances in Neural Information Processing Systems, 2015, 28.
[7] CHONG E, RUIZ N, WANG Y, et al. Connecting gaze, scene, and attention: generalized attention estimation via joint modeling of gaze and scene saliency[C]//Proceedings of the European Conference on Computer Vision, 2018: 383-398.
[8] CHONG E, WANG Y, RUIZ N, et al. Detecting attended visual targets in video[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 5396-5406.
[9] FANG Y, TANG J, SHEN W, et al. Dual attention guided gaze target detection in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 11390-11399.
[10] JIN T, YU Q, ZHU S, et al. Depth-aware gaze-following via auxiliary networks for robotics[J]. Engineering Applications of Artificial Intelligence, 2022, 113: 104924.
[11] LI Y, LIU M, REHG J M. In the eye of the beholder: gaze and actions in first person video[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 6731-6747.
[12] MIN K, CORSO J J. Integrating human gaze into attention for egocentric activity recognition[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021: 1069-1078.
[13] THAKUR S K, BEYAN C, MORERIO P, et al. Predicting gaze from egocentric social interaction videos and IMU data[C]//Proceedings of the 2021 International Conference on Multimodal Interaction, 2021: 717-722.
[14] TURKMEN R, NWAGU C, RAWAT P, et al. Put your glasses on: a voxel-based 3D authentication system in VR using eye-gaze[C]//Proceedings of the 2023 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops, 2023: 947-948.
[15] HU Z, YANG D, CHENG S, et al. We know where they are looking at from the RGB-D camera: gaze following in 3D[J]. IEEE Transactions on Instrumentation and Measurement, 2022, 71: 1-14.
[16] YANG X, XU F, WU K, et al. Gaze-aware graph convolutional network for social relation recognition[J]. IEEE Access, 2021, 9: 99398-99408.
[17] ZHUANG N, NI B, XU Y, et al. Muggle: multi-stream group gaze learning and estimation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 30(10): 3637-3650.
[18] RECASENS A, VONDRICK C, KHOSLA A, et al. Following gaze in video[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017: 1444-1452.
[19] LIAN D, YU Z, GAO S. Believe it or not, we know what you are looking at![C]//Proceedings of the 14th Asian Conference on Computer Vision, 2019: 35-50.
[20] MARIN-JIMENEZ M J, KALOGEITON V, MEDINA-SUAREZ P, et al. LAEO-Net: revisiting people looking at each other in videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 3477-3485.
[21] SUMER O, GERJETS P, TRAUTWEIN U, et al. Attention flow: end-to-end joint attention estimation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020: 3327-3336.
[22] FAN L, WANG W, HUANG S, et al. Understanding human gaze communication by spatio-temporal graph reasoning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 5724-5733.