Overview of Text-to-Image Generation Methods Based on Deep Learning

doi:10.3778/j.issn.1002-8331.2112-0151

Abstract

Abstract: The text-to-image generation method, through using a natural language to map image set features, can generate corresponding images based on natural language descriptions, and use language attributes to intelligently realize the universal expression of visual images. Deep learning technology based on convolutional neural network is the current mainstream method of text-to-image generation. In order to systematically understand the research status and development trend of this field, according to the difference of model construction and technology realization form, the existing technical methods can be divided into six categories：direct text-to-image methods, stacked architecture methods, attention mechanism methods, cycle consistency methods, adapting unconditional model methods and additional supervision methods. In this paper, they are summarized and discussed separately. The construction ideas, model characteristics, advantages and limitations of these methods are discussed, and the main evaluation indicators are analyzed and compared. Finally, the challenges and future prospects of this technology are discussed in terms of model methods, evaluation methods and technological improvements.

Key words: text-to-image generation method, deep learning, convolutional neural network, evaluation indicator

摘要： 文本到图像生成方法采用自然语言与图像集特征的映射方式，根据自然语言描述生成相应图像，利用语言属性智能地实现视觉图像的通用性表达。基于卷积神经网络的深度学习技术是当前文本到图像生成的主流方法，为系统地了解该领域的研究现状和发展趋势，按照模型构建及技术实现形式的不同，将已有的技术方法分为直接图像法、分层体系结构法、注意力机制法、周期一致性法、自适应非条件模型法及附加监督法共六类。分别对这些方法进行总结归纳和讨论，论述其构建思路、模型特点、优势及局限性，并对主要的评价指标开展分析对比，最后讨论该技术在模型方法、评价方法和技术改进等方面面临的挑战及未来展望。

关键词: 文本到图像生成方法, 深度学习, 卷积神经网络, 评价指标

WANG Yuhao, HE Yu, WANG Zhu. Overview of Text-to-Image Generation Methods Based on Deep Learning[J]. Computer Engineering and Applications, 2022, 58(10): 50-67.

王宇昊, 何彧, 王铸. 基于深度学习的文本到图像生成方法综述[J]. 计算机工程与应用, 2022, 58(10): 50-67.

References

[1] FARHADI A，ENDRES I，HOIEM D，et al.Describing objects by their attributes[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition，Miami，Jun 20-25，2009：1778-1785.
[2] KUMAR N，BERG A C，BELHUMEUR P N，et al.Attribute and simile classifiers for face verification[C]//12th IEEE International Conference on Computer Vision，Kyoto，Sep 29-Oct 1，2009：365-372.
[3] FU Y，HOSPEDALES T M，XIANG T，et al.Transductive multi-view embedding for zero-shot recognition and annotation[C]//13th European Conference on Computer Vision，Zurich，Sep 6-12，2014.Cham：Springer，2014：584-599.
[4] AKATA Z，REED S，WALTER D，et al.Evaluation of output embeddings for fine-grained image classification[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition，Boston，Jun 7-12，2015：2927-2936.
[5] GOODFELLOW I J，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial networks[C]//Advances in Neural Information Processing Systems 27：Annual Conference on Neural Information Processing Systems，2014：2672-2680.
[6] REED S，AKATA Z，LEE H，et al.Learning deep representations of fine-grained visual descriptions[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition， Seattle，Jun 27-30，2016：49-58.
[7] YANG Z，HU Z，SALAKHUTDINOV R，et al.Improved variational autoencoders for text modeling using dilated convolutions[C]//34th International Conference on Machine Learning，Sydney，Aug 6-11，2017：3881-3890.
[8] YU J，LU Y，QIN Z，et al.Modeling text with graph convolutional network for cross-modal information retrieval[C]//19th Pacific-Rim Conference on Multimedia，Hefei，Sep 21-22，2018.Cham：Springer，2018：223-234.
[9] LIU Y，HAN K，TAN Z，et al.Using context information for dialog act classification in DNN framework[C]//2017 Conference on Empirical Methods in Natural Language Processing，Copenhagen，Sep 2017.Stroudsburg：ACL，2017：2170-2178.
[10] HIRSCHMAN L，GAIZAUSKAS R.Natural language question answering：the view from here[J].Natural Language Engineering，2001，7（4）：275.
[11] CHEN K，WANG J，CHEN L C，et al.ABC-CNN：an attention based convolutional neural network for visual question answering[J].arXiv：1511.05960，2015.
[12] BAHDANAU D，CHO K，BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv：1409.0473，2014.
[13] WU Y，SCHUSTER M，CHEN Z，et al.Google’s neural machine translation system：bridging the gap between human and machine translation[J].arXiv：1609.08144，2016.
[14] MIRZA M，OSINDERO S.Conditional generative adversarial nets[J].arXiv：1411.1784，2014.
[15] VAN DEN OORD A，KALCHBRENNER N，KAVUKCUOGLU K.Pixel recurrent neural networks[J].arXiv：1601.06759v3，
2016.
[16] KINGMA D P，WELLING M.Auto-encoding variational Bayes[J].arXiv：1312.6114，2013.
[17] ALAIN G，BENGIO Y，YAO L，et al.GSNs：generative stochastic networks[J].Information and Inference，2016，5（2）：210-249.
[18] SALAKHUTDINOV R，HINTON G E.Deep Boltzmann machines[J].Journal of Machine Learning Research，2009，5（2）：1967-2006.
[19] ODENA A，OLAH C，SHLENS J.Conditional image synthesis with auxiliary classifier GANs[C]//34th International Conference on Machine Learning，Sydney，Aug 6-11，2017：2642-2651.
[20] 王艺陆.基于StackGAN的文本图像生成问题研究[D].大连：大连理工大学，2021.
WANG Y L.Research on text image generation based on StackGAN[D].Dalian：Dalian University of Technology，2021.
[21] REED S，AKATA Z，YAN X，et al.Generative adversarial text to image synthesis[C]//33rd International Conference on Machine Learning，New York，Jun 20-22，2016：1060-1069.
[22] DASH A，GAMBOA J，AHMED S，et al.TAC-GAN-Text conditioned auxiliary classifier generative adversarial network[J].arXiv：1703.06412，2017.
[23] ZHANG H，XU T，LI H，et al.StackGAN：text to photo-realistic image synthesis with stacked generative adversarial networks[C]//16th IEEE International Conference on Computer Vision，Venice，Oct 22-29，2017：5907-5915.
[24] ZHANG H，XU T，LI H，et al.StackGAN++：realistic image synthesis with stacked generative adversarial networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2018，41（8）：1947-1962.
[25] 王昭慧.基于生成对抗网络的有条件图像生成研究[D].天津：天津理工大学，2021.
WANG Z H.Research on conditional image generation based on generative adversarial networks[D].Tianjin：Tianjin University of Technology，2021.
[26] ZHANG Z，XIE Y，YANG L.Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]//31st IEEE Conference on Computer Vision and Pattern Recognition，Salt Lake City，Jun 18-23，2018：6199-6208.
[27] GAO L，CHEN D，SONG J，et al.Perceptual pyramid adversarial networks for text-to-image synthesis[C]//33rd AAAI Conference on Artificial Intelligence，Honolulu，Jan 27-Feb 1，2019：8312-8319.
[28] LIN T Y，DOLLáR P，GIRSHICK R，et al.Feature pyramid networks for object detection[C]//2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Honolulu，Jul 21-26，2017：2117-2125.
[29] TAO M，TANG H，WU S，et al.DF-GAN：deep fusion generative adversarial networks for text-to-image synthesis[J].arXiv：2008.05865，2020.
[30] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，Seattle，Jun 27-30，2016：770-778.
[31] 黄韬.文本到人物图像的跨模态生成研究[D].广州：广东技术师范大学，2020.
HUANG T.Research on cross-modal generation from text to character image[D].Guangzhou：Guangdong Technical Normal University，2020.
[32] 吴禹，靳华中.基于文本层级结构的图像描述生成算法[J].湖北工业大学学报，2021，36（4）：17-21.
WU Y，JIN H Z.Image description generation algorithm based on text hierarchy[J].Journal of Hubei University of Technology，2021，36（4）：17-21.
[33] YANG Z，YANG D，DYER C，et al.Hierarchical attention networks for document classification[C]//2016 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2016：1480-1489.
[34] YOUNG T，HAZARIKA D，PORIA S，et al.Recent trends in deep learning based natural language processing[J].IEEE Computational Intelligence Magazine，2018，13（3）：55-75.
[35] XU T，ZHANG P，HUANG Q，et al.AttnGAN：fine-grained text to image generation with attentional generative adversarial networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Salt Lake City，Jun 18-23，2018：1316-1324.
[36] HUANG W，XU Y，OPPERMANN I.Realistic image generation using region-phrase attention[C]//11th Asian Conference on Machine Learning，Nagoya，Nov 17-19，2019：284-299.
[37] 胡北辰.基于GAN的文本生成图像算法研究[J].信阳农林学院学报，2021，31（3）：115-118.
HU B C.Research on text image generation algorithm based on GAN[J].Journal of Xinyang University of Agriculture and Forestry，2021，31（3）：115-118.
[38] TAN H，LIU X，LI X，et al.Semantics-enhanced adversarial nets for text-to-image synthesis[C]//2019 IEEE/CVF International Conference on Computer Vision，Seoul，Oct 27-Nov 2，2019：10501-10510.
[39] LI B，QI X，LUKASIEWICZ T，et al.Controllable text-to-image generation[J].arXiv：1909.07083，2019.
[40] YIN G，LIU B，SHENG L，et al.Semantics disentangling for text-to-image generation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，Jun 16-20，2019：2327-2336.
[41] DUMOULIN V，SHLENS J，KUDLUR M.A learned representation for artistic style[J].arXiv：1610.07629，2016.
[42] LIN T Y，GOYAL P，GIRSHICK R，et al.Focal loss for dense object detection[C]//16th IEEE International Conference on Computer Vision，Venice，Oct 22-29，2017：2980-2988.
[43] CHA M，GWON Y L，KUNG H T.Adversarial learning of semantic relevance in text to image synthesis[C]//33rd AAAI Conference on Artificial Intelligence，Honolulu，Jan 27-Feb 1，2019：3272-3279.
[44] 汪敏.基于跨模态语义关系的图像生成关键技术研究[D].北京：北京交通大学，2021.
WANG M.Research on key technologies of image generation based on cross-modal semantic relations[D].Beijing：Beijing Jiaotong University，2021.
[45] LAO Q，HAVAEI M，PESARANGHADER A，et al.Dual adversarial inference for text-to-image synthesis[C]//2019 IEEE/CVF International Conference on Computer Vision，Seoul，Oct 27-Nov 2，2019：7567-7576.
[46] NGUYEN A，CLUNE J，BENGIO Y，et al.Plug & play generative networks：conditional iterative generation of images in latent space[C]//2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Honolulu，Jul 21-26，2017：4467-4477.
[47] ZHU J Y，PARK T，ISOLA P，et al.Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//16th IEEE International Conference on Computer Vision，Venice，Oct 22-29，2017：2223-2232.
[48] QIAO T，ZHANG J，XU D，et al.MirrorGAN：learning text-to-image generation by redescription[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，Jun 16-20，2019：1505-1514.
[49] ZHU M，PAN P，CHEN W，et al.DM-GAN：dynamic memory generative adversarial networks for text-to-image synthesis[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，Jun 16-20，2019：5802-5810.
[50] STAP D，BLEEKER M，IBRAHIMI S，et al.Conditional image generation and manipulation for user-specified content[J].arXiv：2005.04909，2020.
[51] 胡涛.基于生成对抗网络的文本描述图像生成研究[D].合肥：中国科学技术大学，2021.
HU T.Research on text description image generation based on generative confrontation network[D].Hefei：University of Science and Technology of China，2021.
[52] 徐泽，帅仁俊，刘开凯，等.基于特征融合的文本到图像的生成[J].计算机科学，2021，48（6）：125-130.
XU Z，SHUAI R J，LIU K K，et al.Generation of text to image based on feature fusion[J].Computer Science，2021，48（6）：125-130.
[53] YUAN M，PENG Y.Bridge-GAN：interpretable representation learning for text-to-image synthesis[J].IEEE Transactions on Circuits and Systems for Video Technology，2019，30（11）：4258-4268.
[54] KARRAS T，AILA T，LAINE S，et al.Progressive growing of GANs for improved quality，stability，and variation[J].arXiv：1710.10196，2017.
[55] WANG Z，QUAN Z，WANG Z J，et al.Text to image synthesis with bidirectional generative adversarial network[C]//2020 IEEE International Conference on Multimedia and Expo，Jul 6-10，2020：1-6.
[56] BROCK A，DONAHUE J，SIMONYAN K.Large scale GAN training for high fidelity natural image synthesis[J].arXiv：1809.11096，2018.
[57] JOSEPH K J，PAL A，RAJANALA S，et al.C4Synth：cross-caption cycle-consistent text-to-image synthesis[C]//19th IEEE Winter Conference on Applications of Computer Vision，Waikoloa Village，Jan 7-11，2019：358-366.
[58] CHENG J，WU F，TIAN Y，et al.RiFeGAN：rich feature generation for text-to-image synthesis from prior knowledge[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Seattle，Jun 13-19，2020：10911-10920.
[59] NIU T，FENG F，LI L，et al.Image synthesis from locally related texts[C]//2020 International Conference on Multimedia Retrieval，2020：145-153.
[60] HINZ T，HEINRICH S，WERMTER S.Generating multiple objects at spatially distinct locations[C]//2019 International Conference on Learning Representations，New Orleans，May 6-9，2019.
[61] HINZ T，HEINRICH S，WERMTER S G.Semantic object accuracy for generative text-to-image synthesis[J].arXiv：1910.13321，2019.
[62] SYLVAIN T，ZHANG P，BENGIO Y，et al.Object-centric image generation from layouts[J].arXiv：2003.07449，2020.
[63] HONG S，YANG D，CHOI J，et al.Inferring semantic layout for hierarchical text-to-image synthesis[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Salt Lake City，Jun 18-23，2018：7986-7994.
[64] QIAO T，ZHANG J，XU D，et al.Learn，imagine and create：text-to-image generation from prior knowledge[C]//Advances in Neural Information Processing Systems 32：Annual Conference on Neural Information Processing Systems 2019，Vancouver，Dec 8-14，2019：887-897.
[65] WANG M，LANG C，LIANG L，et al.Attentive generative adversarial network to bridge multi-domain gap for image synthesis[C]//2020 International Conference on Multimedia and Expo，Jul 6-10，2020：1-6.
[66] PAVLLO D，LUCCHI A，HOFMANN T.Controlling style and semantics in weakly-supervised image generation[C]//16th European Conference on Computer Vision.Cham：Springer，2020：482-499.
[67] JOHNSON J E，GUPTA A，LI F F.Image generation from scene graphs[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Salt Lake City，Jun 18-23，2018：1219-1228.
[68] VO D M，SUGIMOTO A.Visual-relation conscious image generation from structured-text[C]//16th European Conference on Computer Vision.Cham：Springer，2020：290-306.
[69] SHI X，CHEN Z，WANG H，et al.Convolutional LSTM network：a machine learning approach for precipitation nowcasting[C]//29th Annual Conference on Neural Information Processing Systems，Montreal，Dec 7-12，2015：802-810.
[70] LI Y，MA T，BAI Y，et al.PasteGAN：a semi-parametric method to generate image from scene graph[C]//33rd Conference on Neural Information Processing Systems，Vancouver，Dec 8-14，2019：3948-3958.
[71] LUCIC M，KURACH K，MICHALSKI M，et al.Are GANs created equal? A large-scale study[C]//Advances in Neural Information Processing Systems 31：Annual Conference on Neural Information Processing Systems 2018，Montréal，Dec 3-8，2018：698-707.
[72] ISOLA P，ZHU J Y，ZHOU T，et al.Image-to-image translation with conditional adversarial networks[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Honolulu，Jul 21-26，2017：1125-1134.
[73] GOODFELLOW I J，SHLENS J，SZEGEDY C.Explaining and harnessing adversarial examples[J].arXiv：1412.6572，2014.
[74] HEUSEL M，RAMSAUER H，UNTERTHINER T，et al.GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]//Advances in Neural Information Processing Systems 30：Annual Conference on Neural Information Processing Systems 2017，Long Beach，Dec 4-9，2017：6626-6637.
[75] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems 27：Annual Conference on Neural Information Processing Systems 2014，Montreal，Dec 8-13，2014：2672-2680.
[76] IM D J，KIM C D，JIANG H，et al.Generating images with recurrent adversarial networks[J].arXiv：1602.05110，2016.
[77] CHE T，LI Y，JACOB A P，et al.Mode regularized generative adversarial networks[J].arXiv：1612.02136，2016.
[78] RAZAVI A，VAN DEN OORD A，VINYALS O.Generating diverse high-fidelity images with VQ-VAE-2[C]//33rd Conference on Neural Information Processing Systems，Vancouver，Dec 8-14，2019：14866-14876.
[79] VAN OORD A，KALCHBRENNER N，KAVUKCUOGLU K.Pixel recurrent neural networks[C]//33rd International Conference on Machine Learning，New York，Jun 20-22，2016：1747-1756.
[80] OORD A，KALCHBRENNER N，VINYALS O，et al.Conditional image generation with PixelCNN decoders[C]//30th Conference on Neural Information Processing Systems，Barcelona，2016：4790-4798.
[81] MENICK J，KALCHBRENNER N.Generating high fidelity images with subscale pixel networks and multidimensional upscaling[J].arXiv：1812.01608，2018.
[82] DINH L，KRUEGER D，BENGIO Y.NICE：non-linear independent components estimation[J].arXiv：1410.8516，2014.
[83] DINH L，SOHL-DICKSTEIN J，BENGIO S.Density estimation using real NVP[J].arXiv：1605.08803，2016.
[84] HYV?RINEN A，DAYAN P.Estimation of non-normalized statistical models by score matching[J].Journal of Machine Learning Research，2005，6（4）：695-709.
[85] SONG Y，ERMON S.Generative modeling by estimating gradients of the data distribution[J].arXiv：1907.05600，2019.
[86] JOLICOEUR-MARTINEAU A，PICHé-TAILLEFER R，COMBES R T，et al.Adversarial score matching and improved sampling for image generation[J].arXiv：2009.05475，2020.
[87] PARMAR N，VASWANI A，USZKOREIT J，et al.Image transformer[C]//35th International Conference on Machine Learning，Stockholm，Jul 10-15，2018：4055-4064.
[88] CHEN M，RADFORD A，CHILD R，et al.Generative pretraining from pixels[C]//2020 International Conference on Machine Learning，Jul 13-18，2020：1691-1703.
[89] ESSER P，ROMBACH R，OMMER B.Taming transformers for high-resolution image synthesis[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Nashville，Jun 21-24，2021：12873-12883.
[90] ZHANG R，ISOLA P，EFROS A A，et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Salt Lake City，Jun 18-23，2018：586-595.
[91] LI W，ZHANG P，ZHANG L，et al.Object-driven text-to-image synthesis via adversarial training[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Long Beach，Jun 16-20，2019：12174-12182.
[92] ZHANG H，KOH J Y，BALDRIDGE J，et al.Cross-modal contrastive learning for text-to-image generation[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition，Nashville，Jun 21-24，2021：833-842.