计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (21): 164-171.DOI: 10.3778/j.issn.1002-8331.2402-0065

• 模式识别与人工智能 • 上一篇    下一篇

融合序列变分Transformer与对比学习的多样化图像描述生成

刘明明,刘兵,刘浩,张海燕   

  1. 1.江苏建筑职业技术学院 智能制造学院,江苏 徐州 221116
    2.中国矿业大学 计算机科学与技术学院,江苏 徐州 221116
  • 出版日期:2024-11-01 发布日期:2024-10-25

Diverse Image Captioning via Conditional Variational Transformer and Introspective Adversarial Learning

LIU Mingming, LIU Bing, LIU Hao, ZHANG Haiyan   

  1. 1.School of Intelligent Manufacturing, Jiangsu Vocational Institute of Architectural Technology, Xuzhou, Jiangsu 221116, China
    2.School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116, China
  • Online:2024-11-01 Published:2024-10-25

摘要: 近年来,基于Transformer的图像描述生成模型取得了显著的性能提升。然而,现有方法性能严重依赖预定义的指标或交叉熵损失,难以生成自然而多样的描述语句。引入一种序列变分Transformer模型,用于多样化图像描述生成。通过最大化图像模态内互信息的变分下界,缓解模式坍塌问题。最大化图像-文本多模态间的互信息,将序列变分Transformer模型与对比学习无缝集成,从而进一步增强序列变分编码器的表征学习能力,促进多样化描述的生成。在MSCOCO标准数据集上进行了定量和定性实验,在随机生成100个描述语句时,与当前最优结果相比,准确性指标CIDEr(consensus-based image description evaluation)提升了5.5%,多样性指标Div-2(2-gram diversity)提升了10.5%。

关键词: 图像理解, 图像描述, 对比学习, 互信息

Abstract: In recent years, Transformer based image description generation models have achieved significant performance improvements. However, the performance of existing methods heavily rely on predefined metrics or cross entropy loss, making it difficult to generate natural and diverse descriptive statements. To this end, a sequence variational Transformer model is first introduced for generating diverse image descriptions. Then, the variational lower bound of mutual information within image modalities is maximized to alleviate the problem of pattern collapse. Finally, this paper maximizes the mutual information between image text multimodality, seamlessly integrates the sequence variational Transformer model with contrastive learning, further enhances the representation learning ability of the sequence variational encoder and promotes the generation of diverse descriptions. Quantitative and qualitative experiments are conducted on the MSCOCO standard dataset. When randomly generating 100 descriptive statements, compared with the current optimal results, the accuracy index CIDEr (consensus-based image description evaluation) is improved by 5.5%, and the diversity index Div-2 (2-gram Diversity) is improved by 10.5%, respectively.

Key words: image understanding, image captioning, contrastive learning, mutual information