Survey About Generative Adversarial Network Based Text-to-Image Synthesis

doi:10.3778/j.issn.1002-8331.2205-0119

Abstract

Abstract: With the rapid development of deep learning, the field of text image synthesis based on generative adversarial network has become a hot spot for computer vision research nowadays. Generative adversarial network consists of two neural networks, the generator and the discriminator, which compete against each other. Inspired by generative adversarial network, a series of text-to-image synthesis models have been proposed in recent years, and breakthroughs have been made in terms of image quality, diversity, and semantic consistency. A comprehensive overview of existing text-to-image synthesis techniques is presented to promote research development in the text-to-image synthesis field. The text-to-image synthesis models are categorized in terms of text encoding, text-direct image synthesis, and text-guided image synthesis. The model framework and key contributions of various representative generative adversarial network based models are discussed. The existing evaluation metrics and commonly used datasets are analyzed, and the deficiencies and future trends of existing methods in complex scenes and texts, multimodality, lightweight models, model evaluation methods, etc. are presented. It completes the current development of generative adversarial network in various fields, focusing on applications in the text-to-image synthesis field. The analysis provides a guide for researchers to measure and apply the deep learning based text image synthesis methods.

Key words: text-to-image synthesis, generative adversarial network, text encoding, deep learning

摘要： 随着深度学习的快速发展，基于生成对抗网络的文本图像合成领域成为了当下计算机视觉研究的热点。生成对抗网络同时包含生成器和鉴别器，通过两者的博弈来实现逼真数据的生成。受生成对抗网络的启发，近几年提出了一系列的文本图像合成模型，从图像质量、多样性、语义一致性方面不断取得突破。为推动文本图像合成领域的研究发展，对现有文本图像合成技术进行了全面概述。从文本编码、文本直接合成图像、文本引导图像合成方面对文本图像合成模型进行了分类整理，并详细探讨了各类基于生成对抗网络的代表性模型的模型框架和关键性贡献。分析了现有的评估指标和常用的数据集，提出了现有方法在复杂场景和文本、多模态、轻量化模型、模型评价方法等方面的不足和未来的发展趋势。总结了目前生成对抗网络在各领域的发展，重点关注了在文本图像合成领域的应用，可以作为一个研究人员进行图像合成研究时选择深度学习相关方法的权衡和参考。

关键词: 文本图像合成, 生成对抗网络, 文本编码, 深度学习

WANG Wei, LI Yujie, GUO Fulin, LIU Yan, HE Junlin. Survey About Generative Adversarial Network Based Text-to-Image Synthesis[J]. Computer Engineering and Applications, 2022, 58(19): 14-36.

王威, 李玉洁, 郭富林, 刘岩, 何俊霖. 生成对抗网络及其文本图像合成综述[J]. 计算机工程与应用, 2022, 58(19): 14-36.

References

[1] 胡名起.基于生成对抗网络的文本生成图像研究[D].南京：东南大学，2020.
HU M Q.Research on text-to-image generation based on generative adversarial network[D].Nanjing：Southeast University，2020.
[2] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems，2014.
[3] 王坤峰，苟超，段艳杰，等.生成式对抗网络GAN的研究进展与展望[J].自动化学报，2017，43（3）：321-332.
WANG K F，GOU C，DUAN Y J，et al.Generative adversarial networks：The state of the art and beyond[J].Acta Automatica Sinica，2017，43（3）：321-332.
[4] AGNESE J，HERRERA J，TAO H，et al.A survey and taxonomy of adversarial neural networks for text-to-image synthesis[J].Wiley Interdisciplinary Reviews：Data Mining and Knowledge Discovery，2020，10（4）：1345.
[5] 李西明，吴嘉润，吴少乾.敌手能力有限时基于生成对抗网络的保密增强[J].计算机科学与探索，2021，15（7）：1220-1226.
LI X M，WU J R，WU S Q.GANs based privacy amplification against bounded adversaries[J].Journal of Frontiers of Computer Science and Technology，2021，15（7）：1220-1226.
[6] 魏富强，古兰拜尔·吐尔洪，买日旦·吾守尔.生成对抗网络及其应用研究综述[J].计算机工程与应用，2021，57（19）：18-31.
WEI F Q，TUERHONG G，WUSHOUER M.Review of research on generative adversarial networks and its application[J].Computer Engineering and Applications，2021，57（19）：18-31.
[7] 米爱中，张伟，乔应旭，等.人脸妆容迁移研究综述[J].计算机工程与应用，2022，58（2）：15-26.
MI A Z，ZHANG W，QIAO Y X，et al.Review of research on facial makeup transfer[J].Computer Engineering and Applications，2022，58（2）：15-26.
[8] FRID-ADAR M，DIAMANT I，KLANG E，et al.GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification[J].Neurocomputing，2018，321：321-331.
[9] 孙晓，丁小龙.基于生成对抗网络的人脸表情数据增强方法[J].计算机工程与应用，2020，56（4）：115-121.
SUN X，DING X L.Data augmentation method based on generative adversarial networks for facial expression recognition sets[J].Computer Engineering and Applications，2020，56（4）：115-121.
[10] JING Y，YANG Y，FENG Z，et al.Neural style transfer：A review[J].IEEE Transactions on Visualization and Computer Graphics，2019，26（11）：3365-3385.
[11] ANDREINI P，BONECHI S，BIANCHINI M，et al.Image generation by GAN and style transfer for agar plate image segmentation[J].Computer Methods and Programs in Biomedicine，2020，184：105268.
[12] LEDIG C，THEIS L，HUSZáR F，et al.Photo-realistic single image super-resolution using a generative adversarial network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：4681-4690.
[13] BULAT A，YANG J，TZIMIROPOULOS G.To learn image super-resolution，use a GAN to learn how to do image degradation first[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：185-200.
[14] BODNAR C.Text to image synthesis using generative adversarial networks[J].arXiv：1805.00676，2018.
[15] 苏赋，吕沁，罗仁泽.基于深度学习的图像分类研究综述[J].电信科学，2019，35（11）：58-74.
SU F，LYU Q，LUO R Z.A review of image classification based on deep learning[J].Telecommunications Science，2019，35（11）：58-74.
[16] MINAEE S，BOYKOV Y Y，PORIKLI F，et al.Image segmentation using deep learning：A survey[J].arXiv：2001.05566，2020.
[17] WOLF T，DEBUT L，SANH V，et al.Hugging face’s transformers：State-of-the-art natural language processing[J].arXiv：1910.03771，2019.
[18] RAMESH A，PAVLOV M，GOH G，et al.Zero-shot text-to-image generation[C]//Proceedings of the International Conference on Machine Learning，2021：8821-8831.
[19] KINGMA D P，WELLING M.Auto-encoding variational bayes[J].arXiv：1312.6114，2013.
[20] REZENDE D J，MOHAMED S，WIERSTRA D.Stochastic backpropagation and approximate inference in deep generative models[C]//Proceedings of the International Conference on Machine Learning，2014：1278-1286.
[21] RADFORD A，METZ L，CHINTALA S.Unsupervised representation learning with deep convolutional generative adversarial networks[J].arXiv：1511.06434，2015.?
[22] REED S，AKATA Z，YAN X，et al.Generative adversarial text to image synthesis[C]//Proceedings of the International Conference on Machine Learning，2016：1060-1069.
[23] DASH A，GAMBOA J C B，AHMED S，et al.Tac-GAN-text conditioned auxiliary classifier generative adversarial network[J].arXiv：1703.06412，2017.
[24] ZHANG H，XU T，LI H，et al.StackGAN：Text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：5907-5915.
[25] ZHANG H，XU T，LI H，et al.StackGAN++：Realistic image synthesis with stacked generative adversarial networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2018，41（8）：1947-1962.
[26] XU T，ZHANG P，HUANG Q，et al.AttnGAN：Fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：1316-1324.
[27] ZHANG Z，XIE Y，YANG L.Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：6199-6208.
[28] ZHANG H，KOH J Y，BALDRIDGE J，et al.Cross-modal contrastive learning for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：833-842.
[29] WAH C，BRANSON S，WELINDER P，et al.The Caltech-UCSD birds-200-2011 dataset[D].California Institute of Technology，2011：1-8.
[30] NILSBACK M E，ZISSERMAN A.Automated flower classification over a large number of classes[C]//Proceedings of the 6th Indian Conference on Computer Vision，Graphics & Image Processing，2008：722-729.
[31] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft COCO：Common objects in context[C]//Proceedings of the European Conference on Computer Vision，2014：740-755.
[32] SALIMANS T，GOODFELLOW I，ZAREMBA W，et al.Improved techniques for training GANs[C]//Advances in Neural Information Processing Systems，2016.
[33] HEUSEL M，RAMSAUER H，UNTERTHINER T，et al.GANs trained by a two time-scale update rule converge to a local NASH equilibrium[C]//Advances in Neural Information Processing Systems，2017.
[34] FROLOV S，HINZ T，RAUE F，et al.Adversarial text-to-image synthesis：A review[J].Neural Networks，2021，144：187-209.
[35] PAN Z，YU W，YI X，et al.Recent progress on generative adversarial networks（GANs）：A survey[J].IEEE Access，2019，7：36322-36333.
[36] WANG Z，SHE Q，WARD T E.Generative adversarial networks in computer vision：A survey and taxonomy[J].ACM Computing Surveys（CSUR），2021，54（2）：1-38.
[37] CRESWELL A，WHITE T，DUMOULIN V，et al.Generative adversarial networks：An overview[J].IEEE Signal Processing Magazine，2018，35（1）：53-65.
[38] WANG K，GOU C，DUAN Y，et al.Generative adversarial networks：Introduction and outlook[J].IEEE/CAA Journal of Automatica Sinica，2017，4（4）：588-598.
[39] MIRZA M，OSINDERO S.Conditional generative adversarial nets[J].arXiv：1411.1784，2014.
[40] TAN H，LIU X，LI X，et al.Semantics-enhanced adversarial nets for text-to-image synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：10501-10510.
[41] LI B，QI X，LUKASIEWICZ T，et al.Controllable text-to-image generation[C]//Advances in Neural Information Processing Systems，2019.
[42] ZHU M，PAN P，CHEN W，et al.Dm-GAN：Dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：5802-5810.
[43] QIAO T，ZHANG J，XU D，et al.MirrorGAN：Learning text-to-image generation by redescription[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：1505-1514.
[44] LI W，ZHANG P，ZHANG L，et al.Object-driven text-to-image synthesis via adversarial training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：12174-12182.
[45] SYLVAIN T，ZHANG P，BENGIO Y，et al.Object-centric image generation from layouts[J].arXiv：2003.07449，2020.
[46] HINZ T，HEINRICH S，WERMTER S.Semantic object accuracy for generative text-to-image synthesis[J].arXiv：1910.13321，2019.
[47] LI B，QI X，LUKASIEWICZ T，et al.ManiGAN：Text-guided image manipulation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：7880-7889.
[48] LI B，QI X，TORR P，et al.Lightweight generative adversarial networks for text-guided image manipulation[C]//Advances in Neural Information Processing Systems，2020：22020-22031.
[49] KARRAS T，LAINE S，AILA T.A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：4401-4410.
[50] XIA W，YANG Y，XUE J H，et al.TediGAN：Text-guided diverse face image generation and manipulation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：2256-2265.
[51] ZHOU Y，ZHANG R，GU J，et al.TiGAN：Text-based interactive image generation and manipulation[C]//Association for the Advancement of Artificial Intelligence，2022.
[52] YIN G，LIU B，SHENG L，et al.Semantics disentangling for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：2327-2336.
[53] YE H，YANG X，TAKAC M，et al.Improving text-to-image synthesis using contrastive learning[J].arXiv：2107. 02423，2021.
[54] ZHU B，NGO C W.CookGAN：Causality based text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：5519-5527.
[55] 卢庆林，叶伟.面向SAR图像处理的生成式对抗网络应用综述[J].电讯技术，2020，60（1）：121-128.
LU Q L，YE W.A survey of generative adversarial network applications for SAR image processing[J].Telecommunications Technology，2020，60（1）：121-128.
[56] GALLO I，NAWAZ S，CALEFATI A.Semantic text encoding for text classification using convolutional neural networks[C]//Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition（ICDAR），2017：16-21.
[57] DAI B，FIDLER S，URTASUN R，et al.Towards diverse and natural image descriptions via a conditional GAN[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2970-2979.
[58] BOJANOWSKI P，JOULIN A，LOPEZ-PAZ D，et al.Optimizing the latent space of generative networks[J].arXiv：1707.05776，2017.
[59] WANG W，WANG R，HUANG Z，et al.Discriminant analysis on Riemannian manifold of Gaussian distributions for face recognition with image sets[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：2048-2057.
[60] AMIN A A.Kullback-Leibler divergence to evaluate posterior sensitivity to different priors for autoregressive time series models[J].Communications in Statistics-Simulation and Computation，2019，48（5）：1277-1291.
[61] MUKHERJEE S，ASNANI H，LIN E，et al.ClusterGAN：Latent space clustering in generative adversarial networks[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：4610-4617.
[62] REED S，AKATA Z，LEE H，et al.Learning deep representations of fine-grained visual descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：49-58.
[63] AKATA Z，REED S，WALTER D，et al.Evaluation of output embeddings for fine-grained image classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：2927-2936.
[64] SCHUSTER M，PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing，1997，45（11）：2673-2681.
[65] ODENA A，OLAH C，SHLENS J.Conditional image synthesis with auxiliary classifier GANs[C]//Proceedings of the International Conference on Machine Learning，2017：2642-2651.
[66] ZHU J Y，PARK T，ISOLA P，et al.Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：2223-2232.
[67] GAO L，CHEN D，SONG J，et al.Perceptual pyramid adversarial networks for text-to-image synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：8312-8319.
[68] DENG Y，YANG J，CHEN D，et al.Disentangled and controllable face image generation via 3D imitative-contrastive learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：5154-5163.
[69] LI Y，MA T，BAI Y，et al.PasteGAN：A semi-parametric method to generate image from scene graph[C]//Advances in Neural Information Processing Systems，2019.
[70] VO D M，SUGIMOTO A.Visual-relation conscious image generation from structured-text[C]//Proceedings of the European Conference on Computer Vision，2020：290-306.
[71] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017.
[72] BAHDANAU D，CHO K，BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv：1409.0473，2014.
[73] LUONG M T，PHAM H，MANNING C D.Effective approaches to attention-based neural machine translation[J].arXiv：1508.04025，2015.
[74] 马力，邹亚莉.嵌入自注意力机制的美学特征图像生成方法[J].计算机科学与探索，2021，15（9）：1728-1739.
MA L，ZOU Y L.Aesthetic feature image generation method embedded with self-attention mechanism[J].Journal of Frontiers of Computer Science and Technology，2021，15（9）：1728-1739.
[75] SUKHBAATAR S，WESTON J，FERGUS R.End-to-end memory networks[C]//Advances in Neural Information Processing Systems，2015.
[76] GULCEHRE C，CHANDAR S，CHO K，et al.Dynamic neural Turing machine with continuous and discrete addressing schemes[J].Neural Computation，2018，30（4）：857-884.
[77] MILLER A，FISCH A，DODGE J，et al.Key-value memory networks for directly reading documents[J].arXiv：1606.03126，2016.
[78] KARPATHY A，FEI-FEI L.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：3128-3137.
[79] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[80] GREFF K，SRIVASTAVA R K，KOUTNíK J，et al.LSTM：A search space odyssey[J].IEEE Transactions on Neural Networks and Learning Systems，2016，28（10）：2222-2232.
[81] ZHANG Z，SABUNCU M.Generalized cross entropy loss for training deep neural networks with noisy labels[C]//Advances in Neural Information Processing Systems，2018.
[82] LAO Q，HAVAEI M，PESARANGHADER A，et al.Dual adversarial inference for text-to-image synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：7567-7576.
[83] NGUYEN A，CLUNE J，BENGIO Y，et al.Plug & play generative networks：Conditional iterative generation of images in latent space[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：4467-4477.
[84] 王宇昊，何彧，王铸.基于深度学习的文本到图像生成方法综述[J].计算机工程与应用，2022，58（10）：50-67.
WANG Y H，HE Y，WANG Z.Overview of text-to-image generation methods based on deep learning[J].Computer Engineering and Applications，2022，58（10）：50-67.
[85] JAISWAL A，BABU A R，ZADEH M Z，et al.A survey on contrastive self-supervised learning[J].Technologies，2020，9（1）：2.
[86] KHOSLA P，TETERWAK P，WANG C，et al.Supervised contrastive learning[C]//Advances in Neural Information Processing Systems，2020：18661-18673.
[87] KANG M，PARK J.ContraGAN：Contrastive learning for conditional image generation[C]//Advances in Neural Information Processing Systems，2020：21357-21369.
[88] VAN DEN OORD A，LI Y，VINYALS O.Representation learning with contrastive predictive coding[J].arXiv：1807. 03748，2018.
[89] LIU X，YIN G，SHAO J，et al.Learning to predict layout-to-image conditional convolutions for semantic image synthesis[C]//Advances in Neural Information Processing Systems，2019.
[90] HINZ T，HEINRICH S，WERMTER S.Generating multiple objects at spatially distinct locations[J].arXiv：1901. 00686，2019.
[91] TAO M，TANG H，WU F，et al.DF-GAN：A simple and effective baseline for text-to-image synthesis[J].arXiv：2008.05865，2020.
[92] KOCASARI U，DIRIK A，TIFTIKCI M，et al.StyleMC：Multi-channel based fast text-guided image generation and manipulation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision，2022：895-904.
[93] ZHOU R，JIANG C，XU Q.A survey on generative adversarial network-based text-to-image synthesis[J].Neurocomputing，2021，451：316-336.
[94] KARRAS T，LAINE S，AITTALA M，et al.Analyzing and improving the image quality of styleGAN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：8110-8119.
[95] PARK T，LIU M Y，WANG T C，et al.Semantic image synthesis with spatially-adaptive normalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：2337-2346.
[96] DUMOULIN V，SHLENS J，KUDLUR M.A learned representation for artistic style[J].arXiv：1610.07629，2016.
[97] XIA X，XU C，NAN B.Inception-v3 for flower classification[C]//Proceedings of the 2nd International Conference on Image，Vision and Computing（ICIVC），2017：783-787.
[98] SZEGEDY C，VANHOUCKE V，IOFFE S，et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：2818-2826.
[99] JOHNSON J，ALAHI A，FEI-FEI L.Perceptual losses for real-time style transfer and super-resolution[C]//Proceedings of the European Conference on Computer Vision，2016：694-711.
[100] KARRAS T，AILA T，LAINE S，et al.Progressive growing of GANs for improved quality，stability，and variation[J].arXiv：1710.10196，2017.
[101] XIA W，ZHANG Y，YANG Y，et al.GAN inversion：A survey[J].arXiv：2101.05278，2021.
[102] ZHU J，SHEN Y，ZHAO D，et al.In-domain GAN inversion for real image editing[C]//Proceedings of the European Conference on Computer Vision，2020：592-608.
[103] PATASHNIK O，WU Z，SHECHTMAN E，et al.Styleclip：Text-driven manipulation of styleGAN imagery[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：2085-2094.
[104] KRIZHEVSKY A，HINTON G.Learning multiple layers of features from tiny images[J].Handbook of Systemic Autoimmune Diseases，2009，1（4）：1-60.
[105] XIAO H，RASUL K，VOLLGRAF R.Fashion-MNIST：A novel image dataset for benchmarking machine learning algorithms[J].arXiv：1708.07747，2017.
[106] ZHANG R，ISOLA P，EFROS A A，et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：586-595.
[107] 刘文婷，卢新明.基于计算机视觉的Transformer研究进展[J].计算机工程与应用，2022，58（6）：1-16.
LIU W T，LU X M.Research progress of transformer based on computer vision[J].Computer Engineering and Applications，2022，58（6）：1-16.