计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (14): 240-249.DOI: 10.3778/j.issn.1002-8331.2305-0022

• 图形图像处理 • 上一篇    下一篇

基于注意力机制和深度先验的注视点检测网络

朱芸,朱冬晨,张广慧,孙彦赞,张晓林   

  1. 1.上海大学 通信与信息工程学院,上海 200444
    2.中国科学院 上海微系统与信息技术研究所 仿生视觉系统实验室,上海 200050
    3.中国科学技术大学,合肥 230026
    4.上海科技大学,上海 201210
    5.中国科学院 雄安创新研究院,河北 雄安 071702
  • 出版日期:2024-07-15 发布日期:2024-07-15

Gaze Target Detection Network Based on Attention Mechanism and Depth Prior

ZHU Yun, ZHU Dongchen, ZHANG Guanghui, SUN Yanzan, ZHANG Xiaolin   

  1. 1.School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China
    2.Bionic Vision System Laboratory,Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China
    3. University of Science and Technology of China, Hefei 230026, China
    4. ShanghaiTech University, Shanghai 201210, China
    5. Xiong’an Institute of Innovation, Chinese Academy of Sciences, Xiong’an, Hebei 071702, China
  • Online:2024-07-15 Published:2024-07-15

摘要: 人类注视行为作为一种非语言线索,对揭示人类意图起着重要作用,注视点检测在机器视觉领域已引起广泛关注。然而,现有方法多聚焦于图像的纹理信息提取,忽视了立体深度信息对注视点估计的重要性,难以应对纹理复杂场景。对此,提出了一种新的基于注意力机制和深度先验的注视点检测网络,包括面部视线方向预测与场景显著性检测两个阶段。在视线方向预测阶段,建立通道-空间注意力机制模块以重校准纹理特征,并设计头部位置编码分支,实现纹理和头部位置感知增强的高表征特征,以准确预测视线方向。进一步,提出将表征三维场景中立体或距离信息的深度作为先验引入到显著性检测阶段的策略,同时通过通道-空间注意力机制增强多尺度纹理特征,充分发挥深度几何信息和图像纹理信息的优势,提高注视点检测的准确性。实验结果表明,在两个权威数据集GazeFollow和DLGaze上与其他先进方法相比,该模型表现出显著的优越性。

关键词: 注视点检测, 注意力机制, 深度先验, 特征融合, 神经网络

Abstract: Human gaze behavior, as a non-verbal cue, plays a crucial role in revealing human intentions. Gaze target detection has attracted extensive attention from the machine vision community. However, existing gaze target detection methods usually focus on the texture information extraction of images, ignoring the importance of stereo depth information for gaze target detection, which makes it difficult to deal with scenes with complex texture. In this work, a novel gaze target detection network based on attention mechanism and depth prior is proposed, which adopts two-stage architecture (i.e., a gaze direction prediction stage and a saliency detection stage). In the gaze direction predication stage, a channel-spatial attention mechanism module is established to recalibrate texture features, and a head position encoding branch is designed to achieve texture and head position-aware enhanced high-representation features to accurately predict gaze. Furthermore, a strategy is proposed to introduce the depth representing the stereoscopic or distance information in the 3D scene as a prior into the saliency detection stage. At the same time, the channel-spatial attention mechanism is used to enhance the multi-scale texture features, and the advantages of depth geometric information and image texture information are fully utilized to improve the accuracy of gaze target detection. Experimental results show that the proposed model performs favorably against the state-of-the-art methods on GazeFollow and DLGaze datasets.

Key words: gaze target detection, attention mechanism, depth prior, feature aggregation, neural network