Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (3): 249-258.DOI: 10.3778/j.issn.1002-8331.2009-0194

• Graphics and Image Processing • Previous Articles     Next Articles

Text-to-Single Image Method Based on Self-Attention

JU Sibo, XU Jing, LI Yanfang   

  1. School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China
  • Online:2022-02-01 Published:2022-01-28



  1. 长春理工大学 计算机科学技术学院,长春 130022

Abstract: Text-to-image is drawing increasing attention in artificial intelligence field. Benefited from the GANs, it has made a remarkable improvement on high-resolution image synthesis. However, there are still some shortages in natural representation for single-target synthesis, such as the abnormal composition in bird images. To address this issue, the SA-AttnGAN is proposed as a single-target model of text generation based on self-attention mechanism. To improve semantic alignment of text and image, it refines the text vectors into the features in both word-level and sentence-level. The self-attention is applied in the initial stage of AttnGAN to increase the stability during image generation. Multi-stage GANs is adopted to synthesize the images in high-resolution. Experiments show the proposed work outperforms other models on Inception Score and Frechet Inception Distance. Synthesis image analysis demonstrates SA-AttnGAN succeeds in learning background and color information, capturing the correct composition of bird’s head, mouth and other parts, and effectively alleviating the problem of “multi-head” and “multi-mouth” occurred in AttnGAN. Additionally, SA-AttnGAN is successfully extended to synthesize clothing images with Chinese description, which shows the adaptation and generalization of this model.

Key words: text-to-image, generative adversarial networks(GAN), deep learning, computer vision, artificial intelligence(AI)

摘要: 基于自然语言描述的图像合成已成为人工智能领域中的研究热点。借助生成对抗网络,该领域在高分辨率图像合成方面取得了长足的发展。然而,合成单目标图像在真实性上仍存在一定缺陷,如针对鸟类图形合成时,会出现“多头”“多嘴”等异常情况。针对此类问题,提出基于自注意力机制的文本生成单目标模型SA-AttnGAN。SA-AttnGAN将文本特征细化为单词特征与句子特征,提高文本-图像的语义对齐性;在AttnGAN初始化阶段,使用自注意力机制,提升文本生成图像模型的稳定性;利用多阶段GAN网络叠加,最终合成高分辨图像。实验数据表明,SA-AttnGAN在Inception Score与Frechet Inception Distance指标得分上优于其他对比模型;合成图像分析表明,本模型不仅可以学习到背景与颜色信息,也能够正确捕捉鸟类头部、嘴部等组成部分的结构性信息,改善AttnGAN模型生成“多头”“多嘴”等错误图像情况。此外,SA-AttnGAN成功地应用于基于中文描述的服装图像合成,具有良好的泛化能力。

关键词: 文本生成图像, 生成对抗网络, 深度学习, 计算机视觉, 人工智能