生成对抗网络文字生成图像算法综述

doi:10.3778/j.issn.1002-8331.2204-0441

摘要/Abstract

摘要： 生成对抗网络是图像合成的重要方法，也是目前实现文字生成图像任务最多的手段。随着跨模态生成研究不断地深入，文字生成图像的真实度与语义相关性得到了巨大提升，无论是生成花卉、鸟类、人脸等自然图像，还是生成场景图和布局，都取得了较好的成果。同时，文字生成图像技术也存在面临着一些挑战，如难以生成复杂场景中的多个物体，以及现有的评估指标不能准确地评估新提出的文字生成图像算法，需要提出新的算法评价指标。回顾了文字生成图像方法自提出以来的发展状况，列举了近年提出的文字生成图像算法、常用数据集和评估指标。最后从数据集、指标、算法和应用方面探讨了目前存在的问题，并展望了今后的研究方向。

关键词: 图像合成, 生成对抗网络, 文字生成图像

Abstract: Generative adversarial network is an important method of image synthesis, and the most commonly used method for text to image synthesis. With the deepening of cross-modal generation research, the realism and semantic relevance of text to images have been greatly improved. Good results have been achieved in the synthesis of natural images such as flowers, birds and human faces, as well as in the synthesis of scene graph and layouts. Meanwhile, there are challenges： it is hard to generate multiple objects in a complex scene, and new methods of text to image synthesis cannot be accurately evaluated, new metrics need to be proposed. This paper reviews the development of state-of-the-art text to image methods, and lists methods, datasets and evaluation metrics proposed in recent years. Finally, the existing problems about dataset, metrics, method and application are discussed, and the future research direction is prospected.

Key words: image synthesis, generative adversarial networks, text to image

邓博, 贺春林, 徐黎明, 宋兰玉. 生成对抗网络文字生成图像算法综述[J]. 计算机工程与应用, 2022, 58(23): 42-55.

DENG Bo, HE Chunlin, XU Liming, SONG Lanyu. Text-to-Image Synthesis： Survey of State-of-the-Art[J]. Computer Engineering and Applications, 2022, 58(23): 42-55.

参考文献

[1] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems，2014：2672-2680.
[2] MIRZA M，OSINDERO S.Conditional generative adversarial nets[J].arXiv：1411.1784，2014.
[3] REED S，AKATA Z，YAN X，et al.Generative adversarial text to image synthesis[C]//International Conference on Machine Learning，2016：1060-1069.
[4] FROLOV S，HINZ T，RAUE F，et al.Adversarial text-to-image synthesis：a review[J].Neural Networks，2021，144：187-209.
[5] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural Computation，1997，9（8）：1735-1780.
[6] GREGOR K，DANIHELKA I，GRAVES A，et al.DRAW：a recurrent neural network for image generation[C]//International Conference on Machine Learning，2015：1462-1471.
[7] MANSIMOV E，PARISOTTO E，BA J L，et al.Generating images from captions with attention[C]//International Conference on Learning Representations，2016.
[8] KINGMA D P.Max welling auto-encoding variational Bayes[C]//International Conference on Learning Representations，2014.
[9] WANG Z，SHE Q，WARD T E.Generative adversarial networks in computer vision：a survey and taxonomy[J].ACM Computing Surveys，2021，54（2）：1-38.
[10] REED S，AKATA Z，LEE H，et al.Learning deep representations of fine-grained visual descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：49-58.
[11] DASH A，GAMBOA J C B，?AHMED S，et al.TAC-GAN-text conditioned auxiliary classifier generative adversarial network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2017.
[12] ZHANG H，XU T，LI H，el al.StackGAN：text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：5907-5915.
[13] SOUZA D M，WEHRMANN J，RUIZ D D，et al.Efficient neural architecture for text-to-image synthesis[C]//2020 International Joint Conference on Neural Networks，2020：1-8.
[14] XU T，ZHANG P，HUANG Q，et al.AttnGAN：fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：1316-1324.
[15] WANG T，ZHANG T，LOVELL B.Faces à la carte：text-to-face generation via attribute disentanglement[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision，2021：3380-3388.
[16] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies（1）（Long and Short Papers），2019：4171-4186.
[17] ODENA A，OLAH C，SHLENS J.Conditional image synthesis with auxiliary classifier GANs[C]//International Conference on Machine Learning，2016：2642-2651.
[18] ZHANG C，PENG Y.Stacking VAE and GAN for context-aware text-to-image generation[C]//International Conference on Multimedia Big Data，2018：1-5.
[19] TAN F，FENG S，ORDONEZ V.Text2Scene：generating compositional scenes from textual descriptions[C]//Proceedings of the IEEE Computer Vision and Pattern Recognition，2018：6703-6712.
[20] ZHANG H，XU T，LI H，et al.StackGAN++：realistic image synthesis with stacked generative adversarial networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2018，41（8）：1947-1962.
[21] ZHANG Z，XIE Y，YANG L.Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：6199-6208.
[22] BODLA N，HUA G，CHELLAPPA R.Semi-supervised fusedGAN for conditional image generation[C]//European Conference on Computer Vision，2018：669-683.
[23] GAO L，CHEN D，SONG J，et al.Perceptual pyramid adversarial networks for text-to-image synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：8312-8319.
[24] GARG K，SINGH A K，HERREMANS D，et al.PerceptionGAN：real-world image construction from provided text through perceptual understanding[C]//2020 Joint 9th International Conference on Informatics Electronics and Vision ICIEV and 2020 4th International Conference on Imaging Vision and Pattern Recognition，2020：1-7.
[25] LI B，QI X，LUKASIEWICZ T，et al.ManiGAN：text-guided image manipulation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：7880-7889.
[26] ZHANG S，WANG D，ZHAO Z，et al.MGD-GAN：text-to-pedestrian generation through multi-grained discrimination[C]//Chinese Conference on Pattern Recognition and Computer Vision.Cham：Springer，2021：662-673.
[27] RUAN S，ZHANG Y，ZHANG K，et al.DAE-GAN：dynamic aspect-aware GAN for text-to-image synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：13960-13969.
[28] ZHANG H，KOH J Y，BALDRIDGE J，et al.Cross-modal contrastive learning for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2021：833-842.
[29] MAHESHWARI P，JAIN N，VADDAMANU P，et al.Generating compositional color representations from text[C]//Proceedings of the 30th ACM International Conference on Information and Knowledge Management，2021：1222-1231.
[30] ZHU M，PAN P，CHEN W，et al.DM-GAN：dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：5802-5810.
[31] YIN G，LIU B，SHENG L，et al.Semantics disentangling for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：2327-2336.
[32] TAN H，LIU X，LI X，et al.Semantics-enhanced adversarial nets for text-to-image synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：10501-10510.
[33] SARAFIANOS N，XU X，KAKADIARIS I A.Adversarial representation learning for text-to-image matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：5814-5824.
[34] QIAO T，ZHANG J，XU D，et al.MirrorGAN：learning text-to-image generation by redescription[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：1505-1514.
[35] CHEN Z D，LUO Y.Cycle-consistent diverse image synthesis from natural language[C]//IEEE International Conference on Multimedia & Expo Workshops，2019：459-464.
[36] LAO Q，HAVAEI M，PESARANGHADER A，et al.Dual adversarial inference for text-to-image synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：7567-7576.
[37] WANG H，LIN G，HOI S C.Cycle-consistent inverse GAN for text-to-image synthesis[C]//Proceedings of the 29th ACM International Conference on Multimedia，2021.
[38] DAS A S，SAHA S.Self-supervised image-to-text and text-to-image synthesis[C]//International Conference on Neural Information Processing.Cham：Springer，2021：415-426.
[39] NAM S，KIM Y，KIM S J.Text-adaptive generative adversarial networks：manipulating images with natural language[C]//Advances in Neural Information Processing Systems，2018：42-51.
[40] CHA M，GWON Y，KUNG H T.Adversarial learning of semantic relevance in text to image synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018：3272-3279.
[41] YUAN M，PENG Y.Bridge-GAN：interpretable representation learning for text-to-image synthesis[J].IEEE Transactions on Circuits and Systems for Video Technology，2019，30（11）：4258-4268.
[42] HUANG X Z，WANG M，GONG M.Hierarchically-fused generative adversarial network for text to realistic image synthesis[C]//Conference on Computer and Robot Vision，2019：73-80.
[43] LI B，QI X，LUKASIEWICZ T.Controllable text-to-image generation[C]//Advances in Neural Information Processing Systems，2019.
[44] STAP D，BLEEKER M，IBRAHIMI S.Conditional image generation and manipulation for user-specified content[C]//Proceedings of the IEEE Computer Vision and Pattern Recognition Workshop，2020.
[45] WANG Z，QUAN Z，WANG Z，et al.Text to image synthesis with bidirectional generative adversarial network[C]//Conference on Multimedia and Expo，2020：1-6.
[46] ZHANG L，CHEN Q，HU B，et al.Text-guided neural image inpainting[C]//Proceedings of the 28th ACM International Conference on Multimedia，2020.
[47] JEON E，KIM K，KIM D.FA-GAN：feature-aware GAN for text to image synthesis[C]//International Conference on Image Processing，2021：2443-2447.
[48] TAO M，TANG H，WU S，et al.DF-GAN：deep fusion generative adversarial networks for text-to-image synthesis[C]//Conference on Computer Vision and Pattern Recognition，2022.
[49] LAI W S，HUANG J B，AHUJA N，et al.Deep Laplacian pyramid networks for fast and accurate super-resolution[C]//Proceedings of the IEEE Computer Vision and Pattern Recognition，2017：5835-5843.
[50] LIN T Y，DOLLáR P，GIRSHICK R B，et al.Feature pyramid networks for object detection[C]//Proceedings of the IEEE Computer Vision and Pattern Recognition，2017：936-944.
[51] LECUN Y，BOSER B，DENKER J S，et al.Backpropagation applied to handwritten zip code recognition[J].Neural Computation，1989，1（4）：541-552.
[52] KARRAS T，LAINE S，AILA T A.Style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE Computer Vision and Pattern Recognition，2018：4401-4410.
[53] REED S E，AKATA Z，MOHAN S，et al.Learning what and where to draw[C]//Advances in Neural Information Processing Systems，2016：217-225.
[54] LI J，YANG J，HERTZMANN A，et al.LayoutGAN：generating graphic layouts with wireframe discriminators[C]//International Conference on Learning Representations，2019：2-8.
[55] HINZ T，HEINRICH S，WERMTER S.Generating multiple objects at spatially distinct locations[C]//International Conference on Learning Representations，2019.
[56] ZHAO B，MENG L，YIN W，et al.Image generation from layout[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：8584-8593.
[57] HINZ T，HEINRICH S，WERMTER S.Semantic object accuracy for generative text-to-image synthesis[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2020：1552-1565.
[58] FROLOV S，SHARMA A，HEES J，et al.AttrlostGAN：attribute controlled image synthesis from reconfigurable layout and style[C]//DAGM German Conference on Pattern Recognition.Cham：Springer，2021.
[59] SYLVAIN T，ZHANG P，BENGIO Y，et al.Object-centric image generation from layouts[C]//International Conference on Learning Representations，2021：2-7.
[60] HONG S，YANG D，CHOI J，et al.Inferring semantic layout for hierarchical text-to-image synthesis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7986-7994.
[61] LI W，ZHANG P，ZHANG L，et al.Object-driven text-to-image synthesis via adversarial training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：12166-12174.
[62] QIAO T，ZHANG J，XU D，et al.Learn imagine and create：text-to-image generation from prior knowledge[C]//Advances in Neural Information Processing Systems，2019：887-897.
[63] PAVLLO D，LUCCHI A，HOFMANN T.Controlling style and semantics in weakly-supervised image generation[C]//European Conference on Computer Vision.Cham：Springer，2020：482-499.
[64] WANG M，LANG C，LIANG L，et al.End-to-end text-to-image synthesis with spatial constrains[J].ACM Transactions on Intelligent Systems and Technology，2020：1-19.
[65] WANG M，LANG C，LIANG L，et al.Attentive generative adversarial network to bridge multi-domain gap for image synthesis[C]//IEEE International Conference on Multimedia and Expo，2020：1-6.
[66] JOHNSON J，GUPTA A，LI F F.Image generation from scene graphs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：1219-1228.
[67] ASHUAL O，WOLF L.Specifying object attributes and relations in interactive scene generation[C]//Proceedings of the IEEE International Conference on Computer Vision，2019：4561-4569.
[68] LI Y，MA T，BAI Y，et al.PasteGAN：a semi-parametric method to generate image from scene graph[C]//Advances in Neural Information Processing Systems，2019：3950-3960.
[69] VO D M，SUGIMOTO A.Visual-relation conscious image generation from structured-text[C]//European Conference on Computer Vision，2020：290-306.
[70] SHARMA S，SUHUBDY D，MICHALSKI V，et al.Chatpainter：improving text to image generation using dialogue[C]//International Conference on Learning Representations，2018.
[71] FROLOV S，JOLLY S，HEES J，et al.Leveraging visual question answering to improve text-to-image synthesis[C]//Proceedings of the Second Workshop on Beyond Vision and Language：Integrating Real-World Knowledge，2020：17-22.
[72] NIU T，FENG F，LI L，et al.Image synthesis from locally related texts[C]//Proceedings of the International Conference on Multimedia Retrieval，2020：10531-10540.
[73] JIANG Y，HUANG Z，PAN X，et al.Talk-to-edit：fine-grained facial editing via dialog[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：13799-13808.
[74] JOSEPH K J，PAL A，RAJANALA S，et al.C4Synth：cross caption cycle-consistent text-to-image synthesis[C]//IEEE Winter Conference on Applications of Computer Vision，2018：358-366.
[75] LI Y，GAN Z，SHEN Y，et al.StoryGAN：a sequential conditional GAN for story visualization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：6329-6338.
[76] CHENG J，WU F，TIAN Y，et al.RifeGAN：rich feature generation for text-to-image synthesis from prior knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：10911-10920.
[77] HAN F，GUERRERO R，PAVLOVIC V.CookGAN：causality based text-to-image synthesis[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision，2020：5519-5527.
[78] WADHAWAN R，DRALL T，SINGH S，et al.Multi-attributed and structured text-to-face synthesis[C]//International Conference on Technology Engineering Management for Societal Impact Using Marketing Entrepreneurship and Talent，2020.
[79] WAH C，BRANSON S，WELINDER P，et al.The caltech-UCSD Birds-200-2011 dataset：technical report CNS-TR-2011-001[R].California Institute of Technology，2011.
[80] NILSBACK M E，ZISSERMAN A.Automated flower classification over a large number of classes[C]//2008 Sixth Indian Conference on Computer Vision Graphics and Image Processing I，2008：722-729.
[81] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft COCO：common objects in context[C]//European Conference on Computer Vision.Cham：Springer，2014：740-755.
[82] LIU Z，LUO P，WANG X，et al.Deep learning face attributes in the wild[C]//International Conference on Computer Vision，2015：3730-3738.
[83] SALVADOR A，HYNES N，AYTAR Y，et al.Learning cross-modal embeddings for cooking recipes and food images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：3020-3028.
[84] SALIMANS T，GOODFELLOW I，ZAREMBA W，et al.Improved techniques for training GANs[C]//Advances in Neural Information Processing Systems，2016.
[85] SZEGEDY C，VANHOUCKE V，IOFFE S，et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：2818-2826.
[86] HEUSEL M，RAMSAUER H，UNTERTHINER T，et al.GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]//Advances in Neural Information Processing Systems，2017：6626-6637.
[87] ARJOVSKY M，CHINTALA S，BOTTOU L.Wasserstein GAN[C]//International Conference on Machine Learning，2017：214-223.
[88] RANOM J，PEYRE G，DELON J，et al.Wasserstein barycenter and its application to texture mixing[C]//International Conference on Scale Space and Variational Methods in Computer Vision，2011：435-446.
[89] SHMELKOV K，SCHMID C，ALAHARI K.How good is my GAN?[C]//Proceedings of the European Conference on Computer Vision，2018：213-229.
[90] PAPINENI K，ROUKOS S，WARD T，et al.BLEU：a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics，2002：311-318.
[91] LAVIE A，GARWAL A.METEOR：an automatic metric for MT evaluation with high levels of correlation with human judgments[C]//Proceedings of the Second Workshop on Statistical Machine Translation，2007.
[92] VEDANTAM R，LAWRENCE ZITNICK C，PARIKH D.CIDEr：consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015.
[93] ROMBACH R，ESSER P，OMMER B.Network-to-network translation with conditional invertible neural networks[C]//Advances in Neural Information Processing Systems，2020：2784-2797.
[94] BROCK A，DONAHUE J，SIMONYAN K.Large scale GAN training for high fidelity natural image synthesis[C]//International Conference on Learning Representations，2018.
[95] MAO Q，LEE H Y，TSENG H Y，et al.Mode seeking generative adversarial networks for diverse image synthesis[C]//Proceedings of the IEEE Computer Vision and Pattern Recognition，2019：1429-1437.
[96] CHA M，GWON Y，KUNG H T.Adversarial learning of semantic relevance in text to image synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2018：3272-3279.
[97] MENICK J，KALCHBRENNER N.Generating high fidelity images with subscale pixel networks and multidimensional upscaling[C]//International Conference on Learning Representations，2019.
[98] YUAN M，PENG Y.CKD：cross-task knowledge distillation for text-to-image synthesis[J].IEEE Transactions on Multimedia，2019，22（8）：1955-1968.
[99] CHEN M，RADFORD A，CHILD R，et al.Generative pretraining from pixels[C]//International Conference on Machine Learning，2020：1691-1703.
[100] RAMESH A，PAVLOV M，GOH G，et al.Zero-shot text-to-image generation[C]//International Conference on Machine Learning，2021：8821-8831.
[101] LIN T Y，GOYAL P，GIRSHICK R，et al.Focal loss for dense object detection[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2020，42：318-327.
[102] KRISHNA R，ZHU Y，GROTH O，et al.Visual genome：connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision，2017，123（1）：32-73.
[103] PAREKH Z，BALDRIDGE J，CER D，et al.Crisscrossed captions：extended intramodal and intermodal semantic similarity judgments for MS-COCO[C]//Proceedings of Conference of the European Chapter of the Association for Computational Linguistics，2021：2855-2870.
[104] RAVURI S V，VINYALS O.Classification accuracy score for conditional generative models[C]//Advances in Neural Information Processing Systems，2019：12268-12279.
[105] SHARMA P，DING N，GOODMAN S，et al.Conceptual captions：a cleaned hypernymed image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics（Volume 1：Long Papers），2018：2556-2565.
[106] BORJI A.Pros and cons of GAN evaluation measures[J].Computer Vision and Image Understanding，2019，179：41-65.
[107] DENG J，DONG W，SOCHER R，et al.ImageNet：a large-scale hierarchical image database[C]//IEEE Conference on Computer Vision and Pattern Recognition，2009：248-255.
[108] ZHANG R，ISOLA P，EFROS A A，et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：586-595.
[109] HUANG W，XU Y，OPPERMANN I.Realistic image generation using region-phrase attention[C]//Asian Conference on Machine Learning，2019：284-299.
[110] LIANG J，PEI W，LU F.CPGAN：content-parsing generative adversarial networks for text-to-image synthesis[C]//European Conference on Computer Vision.Cham：Springer，2020：491-508.
[111] ZHOU S，GORDON M，KRISHNA R，et al.HYPE：a benchmark for human eye perceptual evaluation of generative models[C]//Advances in Neural Information Processing Systems，2019：3449-3461.
[112] DONG H，YU S，WU C，et al.Semantic image synthesis via adversarial learning[C]//IEEE International Conference on Computer Vision，2017：5706-5714.
[113] LIU Y，DE NADAI M，CAI D，et al.Describe what to change：a text-guided unsupervised image-to image translation approach[C]//Proceedings of the ACM International Conference on Multimedia，2020：1357-1365.
[114] ZHU D，MOGADALA A，KLAKOW D.Image manipulation with natural language using two-sided attentive conditional generative adversarial network[J].Neural Networks，2021，136：207-217.
[115] WANG X，QIAO T，ZHU J，et al.S2IGAN：speech-to-image generation via adversarial learning[C]//Proceedings of Interspeech，2020：2292-2296.
[116] BALAJI Y，MIN M R，BAI B，et al.Conditional GAN with discriminative filter generation for text-to-video synthesis[C]//International Joint Conference on Artificial Intelligence，2019.
[117] DENG K，FEI T，HUANG X，et al.IRC-GAN：introspective recurrent convolutional GAN for text-to-video generation[C]//International Joint Conference on Artificial Intelligence，2019：2216-2222.
[118] CHOI H S，PARK C D.From inference to generation：end-to-end fully self-supervised generation of human face from speech[C]//International Conference on Learning Representations，2020.
[119] JIA Y，WEISS R J，BIADSY F，et al.Direct speech-to-speech translation with a sequence-to-sequence model[C]//Interspeech，2019.
[120] SURIS D，RECASENS A，BAU D，et al.A learning words by drawing images[C]//Proceedings of the IEEE Computer Vision and Pattern Recognition，2019：2029-2038.
[121] LI Y，MIN M R，SHEN D，et al.Video generation from text[C]//Conference on Artificial Intelligence，2018：7065-7072.