Gaze Target Detection Network Based on Attention Mechanism and Depth Prior

doi:10.3778/j.issn.1002-8331.2305-0022

Abstract

Abstract: Human gaze behavior, as a non-verbal cue, plays a crucial role in revealing human intentions. Gaze target detection has attracted extensive attention from the machine vision community. However, existing gaze target detection methods usually focus on the texture information extraction of images, ignoring the importance of stereo depth information for gaze target detection, which makes it difficult to deal with scenes with complex texture. In this work, a novel gaze target detection network based on attention mechanism and depth prior is proposed, which adopts two-stage architecture (i.e., a gaze direction prediction stage and a saliency detection stage). In the gaze direction predication stage, a channel-spatial attention mechanism module is established to recalibrate texture features, and a head position encoding branch is designed to achieve texture and head position-aware enhanced high-representation features to accurately predict gaze. Furthermore, a strategy is proposed to introduce the depth representing the stereoscopic or distance information in the 3D scene as a prior into the saliency detection stage. At the same time, the channel-spatial attention mechanism is used to enhance the multi-scale texture features, and the advantages of depth geometric information and image texture information are fully utilized to improve the accuracy of gaze target detection. Experimental results show that the proposed model performs favorably against the state-of-the-art methods on GazeFollow and DLGaze datasets.

Key words: gaze target detection, attention mechanism, depth prior, feature aggregation, neural network

摘要： 人类注视行为作为一种非语言线索，对揭示人类意图起着重要作用，注视点检测在机器视觉领域已引起广泛关注。然而，现有方法多聚焦于图像的纹理信息提取，忽视了立体深度信息对注视点估计的重要性，难以应对纹理复杂场景。对此，提出了一种新的基于注意力机制和深度先验的注视点检测网络，包括面部视线方向预测与场景显著性检测两个阶段。在视线方向预测阶段，建立通道-空间注意力机制模块以重校准纹理特征，并设计头部位置编码分支，实现纹理和头部位置感知增强的高表征特征，以准确预测视线方向。进一步，提出将表征三维场景中立体或距离信息的深度作为先验引入到显著性检测阶段的策略，同时通过通道-空间注意力机制增强多尺度纹理特征，充分发挥深度几何信息和图像纹理信息的优势，提高注视点检测的准确性。实验结果表明，在两个权威数据集GazeFollow和DLGaze上与其他先进方法相比，该模型表现出显著的优越性。

关键词: 注视点检测, 注意力机制, 深度先验, 特征融合, 神经网络

ZHU Yun, ZHU Dongchen, ZHANG Guanghui, SUN Yanzan, ZHANG Xiaolin. Gaze Target Detection Network Based on Attention Mechanism and Depth Prior[J]. Computer Engineering and Applications, 2024, 60(14): 240-249.

朱芸, 朱冬晨, 张广慧, 孙彦赞, 张晓林. 基于注意力机制和深度先验的注视点检测网络[J]. 计算机工程与应用, 2024, 60(14): 240-249.

References

[1] CONTINENTE A R, KHOSLA A, VONDRICK C, et al. Where are they looking?[C]//Advancesin Neural Information Processing Systems, 2015: 199-207.
[2] LIAN D, YU Z, GAO S. Believe it or not, we know what you are looking at![C]//Proceedings of the Asian Conference on Computer Vision, 2018: 35-50.
[3] CHONG E, WANG Y, RUIZ N, et al.Detecting attended visual targets in video[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition, 2020: 5396-5406.
[4] WANG B, HU T, LI B, et al. Gatector: a unified framework for gaze object prediction[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition, 2021: 19588-19597.
[5] KRIZHEVSKY A, SUTSKEVER I, HINTON G.ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012: 1097-1105.
[6] HE K, ZHANG X, REN S. Deep residual learning for image recognition[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[7] LIN T, DOLLAR P, GIRSHICK R, et al.Feature pyramid networks for object detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017: 41-53.
[8] JIE H, LI S, GANG S. Squeeze-and-excitation networks[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition, 2018: 7132-7141.
[9] WOO S, PARK J, LEE J, et al. CBAM: convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision, 2018: 3-19.
[10] ZHANG Q, YANG Y. SA-Net: shuffle attention for deep convolutional neural networks[J].Computing Research Repository, 2021: 2235-2239.
[11] PARK S, SPURR A, HILLIGES O.Deep pictorial gaze estimation[C]//Proceedings of the European Conference on Computer Vision, 2018: 741-757.
[12] AL-RAHAYFEH A, FAEZIPOUR M.Eye tracking and head movement detection: a state-of-art survey[J].IEEE Journal of Translational Engineering in Health & Medicine, 2013: 11-22.
[13] WHITMIRE E, TRUTOIU L, CAVIN R, et al. EyeContact:scleral coil eye tracking for virtual reality[C]//Proceedings of the 2016 ACM International Symposium, 2016: 184-191.
[14] CHI J, ZHANG P, ZHENG S, et al. Key techniques of eye gaze tracking based on pupil corneal reflection[C]//Proceedings of the 2009 WRI Global Congress on Intelligent Systems, 2009: 133-138.
[15] ZHANG X, SUGANO Y, FRITZ M, et al.It’s written all over your face:full-face appearance-based gaze estimation[J].arXiv:1611.08860, 2017.
[16] ZHANG X, SUGANO Y, FRITZ M, et al.Appearance- based gaze estimation in the wild[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition, 2015: 4511-4520.
[17] MORA K, MONAY F, ODOBEZ J M.EYEDIAP: a database for the development and evaluation of gaze estimation algorithms from RGB and RGB-D cameras[C]//Proceedings of the International Symposium on Eye Tracking Research & Applications, 2014: 255-258.
[18] ZHANG X, SUGANO Y, FRITZ M, et al.MPIIGaze: real-world dataset and deep appearance-based gaze estimation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 41(1): 162-175.
[19] SIMONYAN K, ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:1409. 1556, 2014.
[20] CHENG M, ZHANG G, MITRA N, et al.Global contrast based salient region detection[C]//Proceedings of the Computer Vision and Pattern Recognition, 2011: 409-416.
[21] ZHOU L, YANG Z, YUAN Q, et al.Salient region detection via integrating diffusion-based compactness and local contrast[J].IEEE Transactions on Image Processing, 2015: 3308-3320.
[22] LIU N, HAN J.DHSNet:deep hierarchical saliency network for salient object detection[C]//Proceedings of the Computer Vision & Pattern Recognition, 2016: 678-686.
[23] WANG L, WANG H, LU P, et al.Salient object detection with recurrent fully convolutional networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(7): 1734-1746.
[24] LUO Z, MISHRA A, ACHKAR A, et al.Non-local deep features for salient object detection[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition, 2017:6953-6601.
[25] XIAO J, HAYS J, EHINGER K A, et al.SUN database: large-scale scene recognition from abbey to zoo[C]//Proceedings of the Computer Vision & Pattern Recognition, 2010: 3485-3492.
[26] LIN T, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the European Conference on Computer Vision, 2014: 740-755.
[27] YAO B, JIANG X, KHOSLA A, et al.Human action recognition by learning bases of action attributes and parts[C]//Proceedings of the IEEE International Conference on Computer Vision, 2011:1331-1338.
[28] RUSSAKOVSKY O, DENG J, SU H, et al.ImageNet large scale visual recognition challenge[J].International Journal of Computer Vision, 2015, 115(3): 211-252.
[29] JUDD T, EHINGER K, DURAND F, et al.Learning to predict where humans look[C]//Proceedings of the IEEE International Conference on Computer Vision, 2010: 2106-2113.
[30] RANFTL R, LASINGER K, HAFNER D, et al. Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(3): 1623-1637.
[31] CHONG E, RUIZ N, WANG Y, et al.Connecting gaze, scene, and attention: generalized attention estimation via joint modeling of gaze and scene saliency[C]//Proceedings of the European Conference on Computer Vision, 2018: 397-412.
[32] ZHAO H, LU M, YAO A, et al.Learning to draw sight lines[J].International Journal of Computer Vision, 2019, 128(5): 1-25.
[33]CHEN W, XU H, ZHU C, et al.Gaze estimation via the joint modeling of multiple cues[J].IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(3): 1390-1402.