基于自注意力机制的文本生成单目标图像方法

doi:10.3778/j.issn.1002-8331.2009-0194

摘要/Abstract

摘要： 基于自然语言描述的图像合成已成为人工智能领域中的研究热点。借助生成对抗网络，该领域在高分辨率图像合成方面取得了长足的发展。然而，合成单目标图像在真实性上仍存在一定缺陷，如针对鸟类图形合成时，会出现“多头”“多嘴”等异常情况。针对此类问题，提出基于自注意力机制的文本生成单目标模型SA-AttnGAN。SA-AttnGAN将文本特征细化为单词特征与句子特征，提高文本-图像的语义对齐性；在AttnGAN初始化阶段，使用自注意力机制，提升文本生成图像模型的稳定性；利用多阶段GAN网络叠加，最终合成高分辨图像。实验数据表明，SA-AttnGAN在Inception Score与Frechet Inception Distance指标得分上优于其他对比模型；合成图像分析表明，本模型不仅可以学习到背景与颜色信息，也能够正确捕捉鸟类头部、嘴部等组成部分的结构性信息，改善AttnGAN模型生成“多头”“多嘴”等错误图像情况。此外，SA-AttnGAN成功地应用于基于中文描述的服装图像合成，具有良好的泛化能力。

关键词: 文本生成图像, 生成对抗网络, 深度学习, 计算机视觉, 人工智能

Abstract: Text-to-image is drawing increasing attention in artificial intelligence field. Benefited from the GANs, it has made a remarkable improvement on high-resolution image synthesis. However, there are still some shortages in natural representation for single-target synthesis, such as the abnormal composition in bird images. To address this issue, the SA-AttnGAN is proposed as a single-target model of text generation based on self-attention mechanism. To improve semantic alignment of text and image, it refines the text vectors into the features in both word-level and sentence-level. The self-attention is applied in the initial stage of AttnGAN to increase the stability during image generation. Multi-stage GANs is adopted to synthesize the images in high-resolution. Experiments show the proposed work outperforms other models on Inception Score and Frechet Inception Distance. Synthesis image analysis demonstrates SA-AttnGAN succeeds in learning background and color information, capturing the correct composition of bird’s head, mouth and other parts, and effectively alleviating the problem of “multi-head” and “multi-mouth” occurred in AttnGAN. Additionally, SA-AttnGAN is successfully extended to synthesize clothing images with Chinese description, which shows the adaptation and generalization of this model.

Key words: text-to-image, generative adversarial networks（GAN）, deep learning, computer vision, artificial intelligence（AI）

鞠思博, 徐晶, 李岩芳. 基于自注意力机制的文本生成单目标图像方法[J]. 计算机工程与应用, 2022, 58(3): 249-258.

JU Sibo, XU Jing, LI Yanfang. Text-to-Single Image Method Based on Self-Attention[J]. Computer Engineering and Applications, 2022, 58(3): 249-258.

参考文献

[1] AGNESE J，HERRERA J，TAO H，et al.A survey and taxonomy of adversarial neural networks for text-to-image synthesis[J].Wiley Interdisciplinary Reviews：Data Mining and Knowledge Discovery，2020，10（4）：e1345.
[2] REED S，AKATA Z，YAN X，et al.Generative adversarial text to image synthesis[J].arXiv：1605.05396，2016.
[3] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems，2014：2672-2680.
[4] MIRZA M，OSINDERO S.Conditional generative adversarial nets[J].arXiv：1411.1784，2014.
[5] ZHANG H，XU T，LI H，et al.Stackgan：text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：5907-5915.
[6] ZHANG H，XU T，LI H，et al.Stackgan++：realistic image synthesis with stacked generative adversarial networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2018，41（8）：1947-1962.
[7] XU T，ZHANG P，HUANG Q，et al.Attngan：fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：1316-1324.
[8] WU X，XU K，HALL P.A survey of image synthesis and editing with generative adversarial networks[J].Tsinghua Science and Technology，2017，22（6）：660-674.
[9] WAH C，BRANSON S，WELINDER P，et al.The caltech-ucsd birds-200-2011 dataset，CNS-TR-2011-001[R].California Institute of Technology，2011.
[10] YAN X，YANG J，SOHN K，et al.Attribute2image：Conditional image generation from visual attributes[C]//European Conference on Computer Vision.Cham：Springer，2016：776-791.
[11] RADFORD A，METZ L，CHINTALA S.Unsupervised representation learning with deep convolutional generative adversarial networks[J].arXiv：1511.06434，2015.
[12] ZHANG Z，XIE Y，YANG L.Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：6199-6208.
[13] LI W，ZHANG P，ZHANG L，et al.Object-driven text-to-image synthesis via adversarial training[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：12174-12182.
[14] LI B，QI X，LUKASIEWICZ T，et al.Controllable text-to-image generation[C]//Advances in Neural Information Processing Systems，2019：2065-2075.
[15] MAO X，CHEN Y，LI Y，et al.Bilinear representation for language-based image editing using conditional generative adversarial networks[C]//2019 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2019：2047-2051.
[16] REED S E，AKATA Z，MOHAN S，et al.Learning what and where to draw[C]//Advances in Neural Information Processing Systems，2016：217-225.
[17] DONG H，YU S，WU C，et al.Semantic image synthesis via adversarial learning[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：5706-5714.
[18] PARK H，YOO Y，KWAK N.Mc-gan：multi-conditional generative adversarial network for image synthesis[J].arXiv：1805.01123，2018.
[19] ZHU M，PAN P，CHEN W，et al.Dm-gan：dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：5802-5810.
[20] ODENA A，OLAH C，SHLENS J.Conditional image synthesis with auxiliary classifier gans[C]//International Conference on Machine Learning，2017：2642-2651.
[21] DASH A，GAMBOA J C B，AHMED S，et al.Tac-gan-text conditioned auxiliary classifier generative adversarial network[J].arXiv：1703.06412，2017.
[22] CHA M，GWON Y L，KUNG H T.Adversarial learning of semantic relevance in text to image synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：3272-3279.
[23] QIAO T，ZHANG J，XU D，et al.Mirrorgan：learning text-to-image generation by redescription[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：1505-1514.
[24] YIN G，LIU B，SHENG L，et al.Semantics disentangling for text-to-image generation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：2327-2336.
[25] LI Y，GAN Z，SHEN Y，et al.Storygan：a sequential conditional gan for story visualization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2019：6329-6338.
[26] SURYA S，SETLUR A，BISWAS A，et al.ReStGAN：a step towards visually guided shopper experience via text-to-image synthesis[C]//The IEEE Winter Conference on Applications of Computer Vision，2020：1200-1208.
[27] ZHU S，URTASUN R，FIDLER S，et al.Be your own prada：fashion synthesis with structural coherence[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：1680-1688.
[28] ZHANG H，GOODFELLOW I，METAXAS D，et al.Self-attention generative adversarial networks[C]//International Conference on Machine Learning，2019：7354-7363.
[29] SALIMANS T，GOODFELLOW I，ZAREMBA W，et al.Improved techniques for training gans[C]//Advances in Neural Information Processing Systems，2016：2234-2242.
[30] HEUSEL M，RAMSAUER H，UNTERTHINER T，et al.Gans trained by a two time-scale update rule converge to a local nash equilibrium[C]//Advances in Neural Information Processing Systems，2017：6626-6637.
[31] SCHUSTER M，PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing，1997，45（11）：2673-2681.
[32] SZEGEDY C，VANHOUCKE V，IOFFE S，et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：2818-2826.
[33] KINGMA D P，BA J A.A method for stochastic optimization[J].arXiv：1412.6980，2014.
[34] PING Q，WU B，DING W，et al.Fashion-AttGAN：attribute-aware fashion editing with multi-objective GAN[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops，2019.