计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (21): 265-275.DOI: 10.3778/j.issn.1002-8331.2407-0468

• 图形图像处理 • 上一篇    下一篇

文本属性激活视觉的广义零样本图像识别

闫文尚,张桂梅   

  1. 南昌航空大学 江西省图像处理与模式识别重点实验室,南昌 330063
  • 出版日期:2025-11-01 发布日期:2025-10-31

Text-Attributes Select Visual Token for Generalized Zero-Shot Image Recognition

YAN Wenshang, ZHANG Guimei   

  1. Jiangxi Province Key Laboratory of Image Processing and Pattern Recognition, Nanchang Hangkong University, Nanchang 330063, China
  • Online:2025-11-01 Published:2025-10-31

摘要: 现有的零样本学习方法存在语义信息与视觉特征无法有效对齐,且视觉特征中存在较多冗余信息,导致零样本和广义零样本图像识别精度不佳。针对该问题,提出文本属性激活视觉的广义零样本图像识别方法。借助大语言模型生成判别性语义信息-文本属性。并引入类先验估计模块,计算每个文本属性的先验权重,以增强文本属性的可解释性,优化模型的性能。利用判别性文本属性激活与其对应的视觉特征,有效去除视觉特征中的冗余信息。在先验权重的引导下,将激活的视觉特征与文本属性进行跨模态对齐,以实现更精准高效的视觉语义交互,提高模型的图像识别精度。在三个基准数据集(AWA2、CUB、SUN)上进行自监督广义零样本图像识别实验,在AWA2和SUN数据集上调和平均值均达到最优,分别比次优值提高了1.1和0.8个百分点,在CUB数据集中取得次优,实验结果证明了提出方法的有效性。

关键词: 文本属性, 先验权重, 视觉激活, 跨模态对齐

Abstract: Current zero-shot learning methods struggle to effectively align semantic information with visual features, and the presence of redundant information within visual features leads to suboptimal accuracy in zero-shot and generalized zero-shot image recognition. To address this issue, this paper proposes text-attributes select visual token for generalized zero-shot image recognition. Large language models are utilized to generate discriminative semantic information-text attributes. A class prior estimation module is introduced to compute the prior weight of each text attributes, enhancing its interpretability and optimizing model performance. The text attributes are used to select their corresponding visual features, effectively removing redundant information from the visual features. Under the guidance of the prior weights, the selected visual features are aligned with the text attributes in a cross-modal, enabling more precise and efficient visual-semantic interaction, thereby enhancing image recognition accuracy. Self-supervised generalized zero-shot image recognition experiments conducted on three benchmark datasets (AWA2, CUB, SUN). The harmonic mean achieves state-of-the-art performances on AWA2 and SUN, surpassing the second-best performance by 1.1 and 0.8 percentage points, respectively, and ranks second on the CUB dataset. The experimental results validate the efficacy of the proposed approach.

Key words: text-attributes, prior weights, select visual token, cross-modal alignment