Survey About Generative Adversarial Network and Text-to-Image Synthesis

doi:10.3778/j.issn.1002-8331.2211-0392

Abstract

Abstract: With the popularity of multi-sensors, multi-modal data has received continuous attention from scientific research and industry. The technology of processing multi-source modal information through deep learning is the core. Text-to-image generation is one of the directions of multi-modal technology. Because the images generated by generative adversarial network（GAN） are more realistic, the generation of text images has made excellent progress. It can be used in many fields such as image editing and colorization, style transfer, object deformation, and photo enhancement, etc. In this review, GAN networks based on image generation function are divided into four categories：semantic-enhanced GAN, growth-able GAN, diversity-enhanced GAN, and intelligence-enhanced GAN. According to the direction provided by the taxonomy, the function-based text image generation models are integrated and compared to clarify the context. The existing evaluation indicators and commonly used data sets are analyzed, and the feasibility and future development trend of complex text processing are clarified. This review systematically complements the analysis of generative adversarial networks in text image generation and will help researchers further advance this field.

Key words: multi-modal, generative adversarial network（GAN）, text-to-image synthesis, deep learning

摘要： 随着多传感器的普及，多模态数据获得科研和产业面的持续关注，通过深度学习来处理多源模态信息的技术是核心所在。文本生成图像是多模态技术的方向之一，由于生成对抗网络（GAN）生成图像更具有真实感，使得文本图像生成取得卓越进展。它可用于图像编辑和着色、风格转换、物体变形、照片增强等多个领域。将基于图像生成功能的GAN网络分为四大类：语义增强GAN、可增长式GAN、多样性增强GAN、清晰度增强GAN，并根据分类法提供的方向将基于功能的文本图像生成模型进行整合比较，厘清脉络；分析了现有的评估指标以及常用的数据集，阐明了对复杂文本的处理等方面的可行性以及未来的发展趋势；系统性地补充了生成对抗网络在文本图像生成方面的分析，将有助于研究者进一步推进这一领域。

关键词: 多模态, 生成对抗网络, 文本图像生成, 深度学习

LAI Li’na, MI Yu, ZHOU Longlong, RAO Jiyong, XU Tianyang, SONG Xiaoning. Survey About Generative Adversarial Network and Text-to-Image Synthesis[J]. Computer Engineering and Applications, 2023, 59(19): 21-39.

赖丽娜, 米瑜, 周龙龙, 饶季勇, 徐天阳, 宋晓宁. 生成对抗网络与文本图像生成方法综述[J]. 计算机工程与应用, 2023, 59(19): 21-39.

References

[1] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial networks[J].Communications of the ACM，2020，63（11）：139-144.
[2] RAMESH A，PAVLOV M，GOH G，et al.Zero-shot text-to-image generation[C]//Proceedings of the International Conference on Machine Learning，2021：8821-8831.
[3] 36 T，AILA T，LAINE S，et al.Progressive growing of GANs for improved quality，stability，and variation[J].arXiv：1710.10196，2017.
[4] BERMANO A H，GAL R，ALALUF Y，et al.State‐of‐the‐art in the architecture，methods，and applications of StyleGAN[J].Computer Graphics Forum，2022，41（2）：591-611.
[5] NGUYEN T，LE T，VU H，et al.Dual discriminator generative adversarial nets[C]//Advances in Neural Information Processing Systems，2017.
[6] RADFORD A，METZ L，CHINTALA S.Unsupervised representation learning with deep convolutional generative adversarial networks[J].arXiv：1511.06434，2015.
[7] ARJOVSKY M，BOTTOU L.Towards principled methods for training generative adversarial networks[J].arXiv：1701.04862，2017.
[8] GULRAJANI I，AHMED F，ARJOVSKY M，et al.Improved training of wasserstein GANs[C]//Advances in Neural Information Processing Systems，2017，30.
[9] MIRZA M，OSINDERO S.Conditional generative adversarial nets[J].arXiv：1411.1784，2014.
[10] 魏富强，古兰拜尔·吐尔洪，买日旦·吾守尔.生成对抗网络及其应用研究综述[J].计算机工程与应用，2021，57（19）：18-31.
WEI F Q，TUERHONG G，WUSHOUER M.Review of research on generative adversarial networks and its application[J].Computer Engineering and Applications，2021，57（19）：18-31.
[11] ZHANG H，GOODFELLOW I，METAXAS D，et al.Self-attention generative adversarial networks[C]//Proceedings of the International Conference on Machine Learning，2019：7354-7363.
[12] JING Y，YANG Y，FENG Z，et al.Neural style transfer：a review[J].IEEE Transactions on Visualization and Computer Graphics，2019，26（11）：3365-3385.
[13] KINGMA D P，WELLING M.Auto-encoding variational bayes[J].arXiv：1312.6114，2013.
[14] REZENDE D J，MOHAMED S，WIERSTRA D.Stochastic backpropagation and approximate inference in deep generative models[C]//Proceedings of the International Conference on Machine Learning，2014：1278-1286.
[15] 李西明，吴嘉润，吴少乾.敌手能力有限时基于生成对抗网络的保密增强[J].计算机科学与探索，2021，15（7）：1220-1226.
LI X M，WU J R，WU S Q.GANs based privacy amplification against bounded adversaries[J].Journal of Frontiers of Computer Science and Technology，2021，15（7）：1220-1226.
[16] LEDIG C，THEIS L，HUSZáR F，et al.Photo-realistic single image super-resolution using a generative adversarial network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：4681-4690.
[17] ANDREINI P，BONECHI S，BIANCHINI M，et al.Image generation by GAN and style transfer for agar plate image segmentation[J].Computer Methods and Programs in Biomedicine，2020，184：105268.
[18] ZHANG H，KOH J Y，BALDRIDGE J，et al.Cross-modal contrastive learning for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：833-842.
[19] TAN H，LIU X，YIN B，et al.Cross-modal semantic matching generative adversarial networks for text-to-image synthesis[J].IEEE Transactions on Multimedia，2021，24：832-845.
[20] QI Z，FAN C，XU L，et al.MRP-GAN：multi-resolution parallel generative adversarial networks for text-to-image synthesis[J].Pattern Recognition Letters，2021，147：1-7.
[21] PENNINGTON J，SOCHER R，MANNING C D.Glove：global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing（EMNLP），2014：1532-1543.
[22] LE Q，MIKOLOV T.Distributed representations of sentences and documents[C]//Proceedings of the International Conference on Machine Learning，2014：1188-1196.
[23] JOULIN A，GRAVE E，BOJANOWSKI P，et al.Bag of tricks for efficient text classification[J].arXiv：1607.01759，2016.
[24] 夏鸿斌，肖奕飞，刘渊.融合自注意力机制的长文本生成对抗网络模型[J].计算机科学与探索，2022，16（7）：1603-1610.
XIA H B，XIAO Y F，LIU Y.Long text generation adversarial network model with self-attention mechanism[J].Journal of Frontiers of Computer Science and Technology，2022，16（7）：1603-1610.
[25] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781，2013.
[26] RADFORD A，WU J，CHILD R，et al.Language models are unsupervised multitask learners[J].OpenAI Blog，2019，1（8）：9.
[27] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[28] LAN Z，CHEN M，GOODMAN S，et al.Albert：a lite bert for self-supervised learning of language representations[J].arXiv：1909.11942，2019.
[29] LONG J，SHELHAMER E，DARRELL T.Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：3431-3440.
[30] 魏忠钰，范智昊，王瑞泽，等.从视觉到文本：图像描述生成的研究进展综述[J].中文信息学报，2020，34（7）：19-29.
WEI Z Y，FAN Z H，WANG R Z，et al.From vision to text：a brief survey for image captioning[J].Journal of Chinese Information Processing，2020，34（7）：19-29.
[31] DUMOULIN V，BELGHAZI I，POOLE B，et al.Adversarially learned inference[J].arXiv：1606.00704，2016.
[32] AGNESE J，HERRERA J，TAO H，et al.A survey and taxonomy of adversarial neural networks for text‐to‐image synthesis[J].Wiley Interdisciplinary Reviews：Data Mining and Knowledge Discovery，2020，10（4）：e1345.
[33] ZHAO L，ZHANG Z，CHEN T，et al.Improved transformer for high-resolution gans[C]//Advances in Neural Information Processing Systems，2021：18367-18380.
[34] ARJOVSKY M，CHINTALA S，BOTTOU L.Wasserstein generative adversarial networks[C]//Proceedings of the International Conference on Machine Learning，2017：214-223.
[35] IOFFE S，SZEGEDY C.Batch normalization：accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the International Conference on Machine Learning，2015：448-456.
[36] KARRAS T，LAINE S，AILA T.A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：4401-4410.
[37] GAN-QP J S.A novel GAN framework without gradient vanishing and lipschitz constraint[J].arXiv：1811.07296，2018.
[38] ZHANG Z，LI M，YU J.D2PGGAN：two discriminators used in progressive growing of GANs[C]//Proceedings of the 2019 IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），2019：3177-3181.
[39] RUSSAKOVSKY O，DENG J，SU H，et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision，2015，115（3）：211-252.
[40] 申瑞彩，翟俊海，侯璎真.选择性集成学习多判别器生成对抗网络[J].计算机科学与探索，2022，16（6）：1429-1438.
SHEN R C，ZHAI J H，HOU Y Z.Multi-discriminator generative adversarial networks based on selective ensemble learning[J].Journal of Frontiers of Computer Science and Technology，2022，16（6）：1429-1438.
[41] KARRAS T，LAINE S，AITTALA M，et al.Analyzing and improving the image quality of stylegan[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：8110-8119.
[42] 胡名起.基于生成对抗网络的文本生成图像研究[D].南京：东南大学，2020.
HU M Q.Research on generated image based on generative pair network[D].Nanjing：Southeast University，2020.
[43] REED S，AKATA Z，YAN X，et al.Generative adversarial text to image synthesis[C]//Proceedings of the International Conference on Machine Learning，2016：1060-1069.
[44] TAO M，TANG H，WU F，et al.DF-GAN：a simple and effective baseline for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：16515-16525.
[45] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[46] WAH C，BRANSON S，WELINDER P，et al.The Caltech-UCSD birds-200-2011 dataset[D].California Institute of Technology，2011：1-8.
[47] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft coco：Common objects in context[C]//Proceedings of the European Conference on Computer Vision，2014：740-755.
[48] ZHANG Z，SCHOMAKER L.DiverGAN：an efficient and effective single-stage framework for diverse text-to-image generation[J].Neurocomputing，2022，473：182-198.
[49] LIAO W，HU K，YANG M Y，et al.Text to image generation with semantic-spatial aware GAN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：18187-18196.
[50] WU X，ZHAO H，ZHENG L，et al.Adma-GAN：attribute-driven memory augmented GANs for text-to-image generation[C]//Proceedings of the 30th ACM International Conference on Multimedia，2022：1593-1602.
[51] HUANG M，MAO Z，WANG P，et al.DSE-GAN：dynamic semantic evolution generative adversarial network for text-to-image generation[C]//Proceedings of the 30th ACM International Conference on Multimedia，2022：4345-4354.
[52] QIAO T，ZHANG J，XU D，et al.Mirrorgan：learning text-to-image generation by redescription[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：1505-1514.
[53] ZHANG H，XU T，LI H，et al.StackGAN：text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：5907-5915.
[54] XU T，ZHANG P，HUANG Q，et al.AttnGAN：fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：1316-1324.
[55] ZHU M，PAN P，CHEN W，et al.DM-GAN：dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：5802-5810.
[56] FENG F，NIU T，LI R，et al.Modality disentangled discriminator for text-to-image synthesis[J].IEEE Transactions on Multimedia，2021，24：2112-2124.
[57] LEE M，SEOK J.Controllable generative adversarial network[J].IEEE Access，2019，7：28158-28169.
[58] TAN H，LIU X，LI X，et al.Semantics-enhanced adversarial nets for text-to-image synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：10501-10510.
[59] BERTHELOT D，SCHUMM T，METZ L.BeGAN：boundary equilibrium generative adversarial networks[J].arXiv：1703.10717，2017.
[60] CHENG J，WU F，TIAN Y，et al.RiFeGAN：rich feature generation for text-to-image synthesis from prior knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：10911-10920.
[61] XIA W，YANG Y，XUE J H，et al.TediGAN：text-guided diverse face image generation and manipulation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：2256-2265.
[62] RUAN S，ZHANG Y，ZHANG K，et al.DAE-GAN：dynamic aspect-aware GAN for text-to-image synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：13960-13969.
[63] WANG H，LIN G，HOI S C H，et al.Cycle-consistent inverse GAN for text-to-image synthesis[C]//Proceedings of the 29th ACM International Conference on Multimedia，2021：630-638.
[64] PENG J，ZHOU Y，SUN X，et al.Knowledge-driven generative adversarial network for text-to-image synthesis[C]//Proceedings of ICML 2016，2016.
[65] YANG Y，WANG L，XIE D，et al.Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis[J].IEEE Transactions on Image Processing，2021，30：2798-2809.
[66] HINZ T，HEINRICH S，WERMTER S.Semantic object accuracy for generative text-to-image synthesis[J].arXiv：1910.13321，2019.
[67] CHEN Z，MAO Z，FANG S，et al.Background layout generation and object knowledge transfer for text-to-image generation[C]//Proceedings of the 30th ACM International Conference on Multimedia，2022：4327-4335.
[68] FANG F，LI Z，LUO F，et al.Discriminator modification in GAN for text-to-image generation[C]//Proceedings of the 2022 IEEE International Conference on Multimedia and Expo，2022：1-6.
[69] YANG B，FENG F，WANG X.GR-GAN：gradual refinement text-to-image generation[J].arXiv：2205.11273，2022.
[70] FANG F，LI Z，LUO F，et al.PhraseGAN：phrase-boost generative adversarial network for text-to-image generation[C]//Proceedings of the IEEE International Conference on Multimedia and Expo（ICME），2022.
[71] BENGIO Y，MESNIL G，DAUPHIN Y，et al.Better mixing via deep representations[C]//Proceedings of the International Conference on Machine Learning，2013：552-560.
[72] NILSBACK M E，ZISSERMAN A.Automated flower classification over a large number of classes[C]//Proceedings of the 6th Indian Conference on Computer Vision，Graphics & Image Processing，2008：722-729.
[73] ZHANG Z，ZHOU J，YU W，et al.Text-to-image synthesis：starting composite from the foreground content[J].Information Sciences，2022，607：1265-1285.
[74] HINZ T，HEINRICH S，WERMTER S.Generating multiple objects at spatially distinct locations[J].arXiv：1901. 00686，2019.
[75] WU F，LIU L，HAO F，et al.Text-to-image synthesis based on object-guided joint-decoding transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：18113-18122.
[76] GURUMURTHY S，KIRAN SARVADEVABHATLA R，VENKATESH BABU R.DeliGAN：generative adversarial networks for diverse and limited data[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：166-174.
[77] TAN Y X，LEE C P，NEO M，et al.Text-to-image synthesis with self-supervised learning[J].Pattern Recognition Letters，2022，157：119-126.
[78] QUAN F，LANG B，LIU Y.ARRPNGAN：text-to-image GAN with attention regularization and region proposal networks[J].Signal Processing：Image Communication，2022，106：116728.
[79] HUANG S，CHEN Y.Generative adversarial networks with adaptive semantic normalization for text-to-image synthesis[J].Digital Signal Processing，2022，120：103267.
[80] MA Y，LIU L，ZHANG H，et al.Generative adversarial network based on semantic consistency for text-to-image generation[J].Applied Intelligence，2023，53（4）：4703-4716.
[81] SHI Z，CHEN Z，XU Z，et al.AtHom：two divergent attentions stimulated by homomorphic training in text-to-image synthesis[C]//Proceedings of the 30th ACM International Conference on Multimedia，2022：2211-2219.
[82] CHENG J，WU F，TIAN Y，et al.RiFeGAN2：rich feature generation for text-to-image synthesis from constrained prior knowledge[J].IEEE Transactions on Circuits and Systems for Video Technology，2021，32（8）：5187-5200.
[83] LI B，TORR P H S，LUKASIEWICZ T.Memory-driven text-to-image generation[J].arXiv：2208.07022，2022.
[84] LI Z，MIN M R，LI K，et al.Stylet2i：toward compositional and high-fidelity text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2022：18197-18207.
[85] 王威，李玉洁，郭富林，等.生成对抗网络及其文本图像合成综述[J].计算机工程与应用，2022，58（19）：14-36.
WANG W，LI Y J，GUO F L，et al.Survey about generative adversarial network based text-to-image synthesis[J].Computer Engineering and Applications，2022，58（19）：14-36.
[86] REED S，AKATA Z，LEE H，et al.Learning deep representations of fine-grained visual descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：49-58.
[87] FROLOV S，HINZ T，RAUE F，et al.Adversarial text-to-image synthesis：a review[J].Neural Networks，2021，144：187-209.
[88] HEUSEL M，RAMSAUER H，UNTERTHINER T，et al.GANs trained by a two time-scale update rule converge to a local nash equilibrium[C]//Advances in Neural Information Processing Systems，2017.
[89] SALIMANS T，GOODFELLOW I，ZAREMBA W，et al.Improved techniques for training GANs[C]//Advances in Neural Information Processing Systems，2016.
[90] LI W，WEN S，SHI K，et al.Neural architecture search with a lightweight transformer for text-to-image synthesis[J].IEEE Transactions on Network Science and Engineering，2022，9（3）：1567-1576.
[91] ZHANG Z，SCHOMAKER L.Optimized latent-code selection for explainable conditional text-to-image GANs[C]//Proceedings of the International Joint Conference on Neural Networks（IJCNN），2022：1-9.
[92] ZHANG H，YANG S，ZHU H.CJE-TIG：zero-shot cross-lingual text-to-image generation by Corpora-based Joint Encoding[J].Knowledge-Based Systems，2022，239：108006.
[93] DONAHUE J，KR?HENBüHL P，DARRELL T.Adversarial feature learning[J].arXiv：1605.09782，2016.
[94] CHOI Y，CHOI M，KIM M，et al.StarGAN：Unified generative adversarial networks for multi-domain image-to-image translation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：8789-8797.