计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (14): 264-273.DOI: 10.3778/j.issn.1002-8331.2405-0427

• 图形图像处理 • 上一篇    下一篇

对比学习改进文本生成图像方法的研究

赵宏,王贺,李文改   

  1. 兰州理工大学 计算机与通信学院,兰州 730050
  • 出版日期:2025-07-15 发布日期:2025-07-15

Research on Improving Text-to-Image Generation Method Through Contrastive Learning

ZHAO Hong, WANG He,  LI Wengai   

  1. College of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
  • Online:2025-07-15 Published:2025-07-15

摘要: 针对现有文本生成图像方法中仅依赖图像和文本之间的语义相似度损失为约束,模型难以有效学习到图像与对应多个文本之间的关系,导致生成图像和文本之间语义匹配度低的问题,提出一种引入对比学习对文本生成图像模型改进的方法。在训练阶段,采用对比学习的方法,计算同一图像的不同文本生成图像之间的对比损失,使模型能够学习同一图像的不同文本表示,以提高生成图像和文本语义的一致性。同时,计算生成图像与真实图像之间的对比损失,保证生成图像向真实图像靠拢。在生成器中,设计一种新的特征融合模块,通过注意力图作为条件,引导图像特征与文本特征对齐,从而提高生成图像的细节表达。实验结果表明,与基准模型相比,在CUB数据集上的Inception Score分数提高了7.32%,Fréchet Inception Distance分数下降了21.06%;在COCO数据集上的Fréchet Inception Distance分数下降了36.43%。证明该方法生成的图像具有更好的文本语义一致性和真实性。

关键词: 文本生成图像, 生成对抗网络(GAN), 对比学习, 特征融合, 语义一致性

Abstract: Aiming at the problems that existing text-to-image generation methods only rely on the semantic similarity loss between images and texts as a constraint, which makes it difficult for the model to effectively learn the relationship between images and corresponding multiple texts and results in low semantic matching between generated images and texts, this paper proposes a method to improve the text-to-image generation model through contrastive learning. In the training stage, the contrastive learning method is used to calculate the contrast loss between the generated images of different texts for the same image, so that the model can learn different text representations of the same images to improve the consistency of the generated images and text semantics. At the same time, the contrast loss between the generated images and the real images is calculated to ensure that the generated images are closer to the real images. In the generator, a new feature fusion module is designed to use the attention map as a condition to guide the alignment of images features and text features, thereby improving the detailed expression of the generated images. Experimental results show that compared to the baseline model, the Inception Score on the CUB dataset is increased by 7.32%, the Fréchet Inception Distance score is decreased by 21.06%, and the Fréchet Inception Distance score on the COCO dataset is decreased by 36.43%. The images generated by the method exhibit improved text semantic consistency and authenticity.

Key words: text-to-image generation, generative adversarial network (GAN), contrastive learning, feature fusion, semantic consistency