Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (10): 50-67.DOI: 10.3778/j.issn.1002-8331.2112-0151

• Research Hotspots and Reviews • Previous Articles     Next Articles

Overview of Text-to-Image Generation Methods Based on Deep Learning

WANG Yuhao, HE Yu, WANG Zhu   

  1. 1.Guizhou Tianyan Juheng Technology Co., Ltd., Guiyang, Guizhou 550081, China
    2.College of Earth and Space Sciences, Peking University, Beijing 100871, China
    3.College of Geography & Environmental Science, Guizhou Normal University, Guiyang 550025, China
  • Online:2022-05-15 Published:2022-05-15



  1. 1.贵州天衍炬恒科技有限公司,贵阳 550081
    2.北京大学 地球与空间科学学院,北京 100871
    3.贵州师范大学 地理与环境科学学院,贵阳 550025

Abstract: The text-to-image generation method, through using a natural language to map image set features, can generate corresponding images based on natural language descriptions, and use language attributes to intelligently realize the universal expression of visual images. Deep learning technology based on convolutional neural network is the current mainstream method of text-to-image generation. In order to systematically understand the research status and development trend of this field, according to the difference of model construction and technology realization form, the existing technical methods can be divided into six categories:direct text-to-image methods, stacked architecture methods, attention mechanism methods, cycle consistency methods, adapting unconditional model methods and additional supervision methods. In this paper, they are summarized and discussed separately. The construction ideas, model characteristics, advantages and limitations of these methods are discussed, and the main evaluation indicators are analyzed and compared. Finally, the challenges and future prospects of this technology are discussed in terms of model methods, evaluation methods and technological improvements.

Key words: text-to-image generation method, deep learning, convolutional neural network, evaluation indicator

摘要: 文本到图像生成方法采用自然语言与图像集特征的映射方式,根据自然语言描述生成相应图像,利用语言属性智能地实现视觉图像的通用性表达。基于卷积神经网络的深度学习技术是当前文本到图像生成的主流方法,为系统地了解该领域的研究现状和发展趋势,按照模型构建及技术实现形式的不同,将已有的技术方法分为直接图像法、分层体系结构法、注意力机制法、周期一致性法、自适应非条件模型法及附加监督法共六类。分别对这些方法进行总结归纳和讨论,论述其构建思路、模型特点、优势及局限性,并对主要的评价指标开展分析对比,最后讨论该技术在模型方法、评价方法和技术改进等方面面临的挑战及未来展望。

关键词: 文本到图像生成方法, 深度学习, 卷积神经网络, 评价指标