文本属性激活视觉的广义零样本图像识别

doi:10.3778/j.issn.1002-8331.2407-0468

摘要/Abstract

摘要： 现有的零样本学习方法存在语义信息与视觉特征无法有效对齐，且视觉特征中存在较多冗余信息，导致零样本和广义零样本图像识别精度不佳。针对该问题，提出文本属性激活视觉的广义零样本图像识别方法。借助大语言模型生成判别性语义信息-文本属性。并引入类先验估计模块，计算每个文本属性的先验权重，以增强文本属性的可解释性，优化模型的性能。利用判别性文本属性激活与其对应的视觉特征，有效去除视觉特征中的冗余信息。在先验权重的引导下，将激活的视觉特征与文本属性进行跨模态对齐，以实现更精准高效的视觉语义交互，提高模型的图像识别精度。在三个基准数据集（AWA2、CUB、SUN）上进行自监督广义零样本图像识别实验，在AWA2和SUN数据集上调和平均值均达到最优，分别比次优值提高了1.1和0.8个百分点，在CUB数据集中取得次优，实验结果证明了提出方法的有效性。

关键词: 文本属性, 先验权重, 视觉激活, 跨模态对齐

Abstract: Current zero-shot learning methods struggle to effectively align semantic information with visual features, and the presence of redundant information within visual features leads to suboptimal accuracy in zero-shot and generalized zero-shot image recognition. To address this issue, this paper proposes text-attributes select visual token for generalized zero-shot image recognition. Large language models are utilized to generate discriminative semantic information-text attributes. A class prior estimation module is introduced to compute the prior weight of each text attributes, enhancing its interpretability and optimizing model performance. The text attributes are used to select their corresponding visual features, effectively removing redundant information from the visual features. Under the guidance of the prior weights, the selected visual features are aligned with the text attributes in a cross-modal, enabling more precise and efficient visual-semantic interaction, thereby enhancing image recognition accuracy. Self-supervised generalized zero-shot image recognition experiments conducted on three benchmark datasets (AWA2, CUB, SUN). The harmonic mean achieves state-of-the-art performances on AWA2 and SUN, surpassing the second-best performance by 1.1 and 0.8 percentage points, respectively, and ranks second on the CUB dataset. The experimental results validate the efficacy of the proposed approach.

Key words: text-attributes, prior weights, select visual token, cross-modal alignment

闫文尚, 张桂梅. 文本属性激活视觉的广义零样本图像识别[J]. 计算机工程与应用, 2025, 61(21): 265-275.

YAN Wenshang, ZHANG Guimei. Text-Attributes Select Visual Token for Generalized Zero-Shot Image Recognition[J]. Computer Engineering and Applications, 2025, 61(21): 265-275.

参考文献

[1] AHMADILIVANI M H, TAHERI M, RAIK J, et al. A systematic literature review on hardware reliability assessment methods for deep neural networks[J]. ACM Computing Surveys, 2024, 56(6): 1-39.
[2] LAMPERT C H, NICKISCH H, HARMELING S. Learning to detect unseen object classes by between-class attribute transfer[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 951-958.
[3] CHAO W L, CHANGPINYO S, GONG B Q, et al. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild[C]//Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016: 52-68.
[4] 王泽深, 杨云, 向鸿鑫, 等. 零样本学习综述[J]. 计算机工程与应用, 2021, 57(19): 1-17.
WANG Z S, YANG Y, XIANG H X, et al. Survey on zero-shot learning[J]. Computer Engineering and Applications, 2021, 57(19): 1-17.
[5] XU J Z, DUAN S L, TANG C W, et al. Attribute localization and revision network for zero-shot learning[J]. arXiv:2310. 07548, 2023.
[6] CHEN S M, HONG Z M, HOU W J, et al. TransZero: cross attribute-guided transformer for zero-shot learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 12844-12861.
[7] LIU Y, ZHOU L, BAI X, et al. Goal-oriented gaze estimation for zero-shot learning[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 3793-3802.
[8] ALAMRI F, DUTTA A. Multi-head self-attention via vision transformer for zero-shot learning[J]. arXiv:2108.00045, 2021.
[9] CHEN S M, HONG Z M, XIE G S, et al. MSDN: mutually semantic distillation network for zero-shot learning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 7602-7611.
[10] CHEN Z, HUANG Y F, CHEN J Y, et al. DUET: cross-modal semantic grounding for contrastive zero-shot learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2023: 405-413.
[11] LEI Y, SHENG G S, LI F F, et al. High-discriminative attribute feature learning for generalized zero-shot learning[J]. arXiv:2404.04953, 2024.
[12] YAMADA I, ASAI A, SAKUMA J, et al. Wikipedia2Vec: an efficient toolkit for learning and visualizing the embeddings of words and entities from wikipedia[J]. arXiv:1812. 06280, 2018.
[13] MANCINI M, NAEEM M F, XIAN Y Q, et al. Learning graph embeddings for open world compositional zero-shot learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(3): 1545-1560.
[14] XU W J, XIAN Y Q, WANG J N, et al. VGSE: visually-grounded semantic embeddings for zero-shot learning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 9306-9315.
[15] BUJWID S, SULLIVAN J. Large-scale zero-shot image classification from rich and diverse textual descriptions[J]. arXiv:2103.09669, 2021.
[16] NAEEM M F, XIAN Y Q, VAN GOOL L, et al. I2DFormer: learning image to document attention for zero-shot image classification[J]. arXiv:2209.10304, 2022.
[17] SHUBHO F H, CHOWDHURY T F, CHERAGHIAN A, et al. ChatGPT-guided semantics for zero-shot learning[C]//Proceedings of the 2023 International Conference on Digital Image Computing: Techniques and Applications. Piscataway: IEEE, 2023: 418-425.
[18] NAEEM M F, ALI KHAN M G Z, XIAN Y Q, et al. I2MVFormer: large language model generated multi-view document supervision for zero-shot image classification[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 15169-15179.
[19] MIN B N, ROSS H, SULEM E, et al. Recent advances in natural language processing via large pre-trained language models: a survey[J]. ACM Computing Surveys, 2021, 56: 1-40.
[20] OUYANG L, WU J, XU J, et al. Training language models to follow instructions with human feedback[J]. arXiv:2203. 02155, 2022.
[21] YANG Y, PANAGOPOULOU A, ZHOU S H, et al. Language in a bottle: language model guided concept bottlenecks for interpretable image classification[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 19187-19197.
[22] PRATT S, COVERT I, LIU R, et al. What does a platypus look like? Generating customized prompts for zero-shot image classification[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2023: 15645-15655.
[23] SRIVASTAVA P, GANU T, GUHA S. Towards zero-shot and few-shot table question answering using GPT-3[J]. arXiv:2210.17284, 2022.
[24] BELTAGY I, PETERS M E, COHAN A. Longformer: the long-document transformer[J]. arXiv:2004.05150, 2020.
[25] QU X Y, YU J, GAI K K, et al. Visual-semantic decomposition and partial alignment for document-based zero-shot learning[C]//Proceedings of the 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 4581-4590.
[26] XIAN Y Q, LAMPERT C H, SCHIELE B, et al. Zero-shot learning: a comprehensive evaluation of the good, the bad and the ugly[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(9): 2251-2265.
[27] WAH C, BRANSON S, WELINDER P, et al. Caltech-UCSD Birds-200-2011, CNS-TR-2010-001[R]. California Institute of Technology, 2010.
[28] PATTERSON G, HAYS J. SUN attribute database: discovering, annotating, and recognizing scene attributes[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2012: 2751-2758.
[29] RADFORD A, KIM J W, HALLACY C, et al. Clip: learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning, 2021: 8748-8763.
[30] LIU M, LI F, ZHANG C J, et al. Progressive semantic-visual mutual adaption for generalized zero-shot learning[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 15337-15346.
[31] CHAO W L, CHANGPINYO S, GONG B Q, et al. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild[C]//Proceedings of the 14th European Conference on Computer Vision. Cham: Springer, 2016: 52-68.
[32] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[33] ALAMRI F, DUTTA A. Implicit and explicit attention for zero-shot learning[J]. arXiv:2110.00860, 2021.
[34] CHEN Z, ZHANG P F, LI J J, et al. Zero-shot learning by harnessing adversarial samples[C]//Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM, 2023: 4138-4146.
[35] SONG K T, TAN X, QIN T, et al. MPNet: masked and permuted pre-training for language understanding[J]. arXiv:2004.09297, 2020.
[36] 马瑶, 智敏, 殷雁君, 等. CNN和Transformer在细粒度图像识别中的应用综述[J]. 计算机工程与应用, 2022, 58(19): 53-63.
MA Y, ZHI M, YIN Y J, et al. Review of applications of CNN and transformer in fine-grained image recognition[J]. Computer Engineering and Applications, 2022, 58(19): 53-63.