计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (17): 224-232.DOI: 10.3778/j.issn.1002-8331.2312-0186

• 图形图像处理 • 上一篇    下一篇

基于注意力和动态记忆模块的文本图像生成方法

张鹤,雷浩鹏,王明文,张尚昆   

  1. 江西师范大学 计算机信息工程学院,南昌 330022
  • 出版日期:2024-09-01 发布日期:2024-08-30

Text-to-Image Generation Method Based on Attention and Dynamic Memory Module

ZHANG He, LEI Haopeng, WANG Mingwen, ZHANG Shangkun   

  1. School of Computer and Information Engineering, Jiangxi Normal University, Nanchang 330022, China
  • Online:2024-09-01 Published:2024-08-30

摘要: 针对文本生成图像任务中多阶段生成模型存在的问题,如缺乏图像纹理信息特征和文本描述与生成图像之间一致性差异,提出了一种新颖的生成对抗网络(ADM-GAN)模型。该模型使用注意力和动态记忆模块进行优化。通过文本编码器将文本描述转化为嵌入向量,并利用生成器结合随机噪声生成低分辨率图像。引入了空间注意力和通道注意力模块,旨在融合低分辨率图像隐藏特征与重要的单词级语义特征,从而确保文本描述与图像特征的一致性。使用动态记忆模块捕获文本与图像间的语义对应关系,并根据生成过程动态调整记忆内容,细化图像纹理,提升文本到图像的合成效果。通过在公开的CUB和COCO数据集上的对比实验,同以往方法相比,Fréchet inception distance与inception score有了显著的提升,证明了该模型在一定程度上能够解决图像细节缺失以及语义信息丢失等问题,有效提高了图像与文本的一致性,取得了更加优异的效果。

关键词: 文本生成图像, 生成对抗网络, 注意力机制, 动态记忆

Abstract: Aiming at the problems existing in multi-stage generative models in the text generation image task, such as the lack of image texture information features and the poor consistency between text descriptions and generated images, this paper proposes a novel generative adversarial network (ADM-GAN) model. The model is optimized using attention and dynamic memory modules. In the initial stage, the text description is converted into embedding vectors through a text encoder, and a generator is used to combine random noise to generate low-resolution images. Then, the paper introduces spatial attention and channel attention modules, aiming to fuse low-resolution image hidden features with important word-level semantic features, thereby ensuring the consistency of text description and image features. Finally, the dynamic memory module is used to capture the semantic correspondence between text and images, and dynamically adjust the memory content according to the generation process, refine the image texture, and improve the text-to-image synthesis effect. Through comparative experiments on the public CUB and COCO data sets, compared with previous methods, the Fréchet inception distance and inception score of this paper have been significantly improved, proving that this model can solve the problem of lack of image details and semantic information to a certain extent. It effectively improves the consistency between images and text, and achieves better results.

Key words: text-to-image, generative adversarial network, attention mechanism network, dynamic memory