Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (17): 259-271.DOI: 10.3778/j.issn.1002-8331.2403-0179

• Graphics and Image Processing • Previous Articles     Next Articles

Textual Modality-Assisted RGB Salient Object Detection

HAN Chunyu, MA Jun, SHA Honghan, XIAO Xin, LU Chenkai,  YAN Xin, ZHANG Xia   

  1. 1.State Key Laboratory of Information Photonics and Optical Communications, College of Beijing University of Posts and Telecommunications, Beijing 100876, China
    2.School of Electronic Engineering, Beijing University of Posts and Telecommunications, Bejing 100876, China
  • Online:2025-09-01 Published:2025-09-01

文本模态辅助的RGB显著性目标检测

韩春禹,马骏,沙洪涵,肖鑫,陆晨凯,颜鑫,张霞   

  1. 1.北京邮电大学 信息光子学与光通信全国重点实验室,北京 100876 
    2.北京邮电大学 电子工程学院,北京 100876

Abstract: Salient object detection is the process of identifying the most visually prominent objects in images or videos. Addressing the performance degradation in cluttered scenes with low foreground-background contrast, this paper proposes a salient object detection model assisted by text modality information. The proposed model enhances target representation by integrating RGB features with image captions generated by an image description network, capturing the semantic information of most objects in the scene, thereby suppressing background noise. A cross-modality guidance fusion module is introduced, effectively merging text and RGB modalities through self-interaction and mutual interaction. To address the issue of global attention mechanisms overlooking detailed information, a hybrid attention module is proposed, modeling contextual information at both global and local levels to further improve prediction accuracy. The effectiveness of the proposed model has been experimentally verified on standard benchmarks such as mean absolute error (MAE), structural Similarity (S), and Weighted F-score (FW).

Key words: salient object detection, multimodal, neural networks, textual information

摘要: 显著性目标检测是从图片或视频中寻找客观上最具显著性的物体的过程。针对在杂乱且前背景区分度较低的场景中检测性能下降的问题,提出了一种基于文本信息辅助的显著性目标检测模型。所提出的模型通过融合RGB特征和图像描述网络生成的图像字幕文本信息,以捕捉场景中大多数对象的语义信息,从而抑制背景噪声、增强网络对目标的表征能力。同时提出跨模态指导融合模块,利用自交互和互交互的方式有效融合文本模态和RGB模态。针对全局注意力机制容易忽略细节信息的问题,提出了混合注意力模块,在全局和局部层次上建模上下文信息,从而进一步提高预测的准确性。在通用指标如平均绝对误差(MAE)、结构相似度(S)、加权F值(FW)上,实验证明了所提出模型的有效性。

关键词: 显著性目标检测, 多模态, 神经网络, 文本信息