Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (19): 21-39.DOI: 10.3778/j.issn.1002-8331.2211-0392

• Research Hotspots and Reviews • Previous Articles     Next Articles

Survey About Generative Adversarial Network and Text-to-Image Synthesis

LAI Li’na, MI Yu, ZHOU Longlong, RAO Jiyong, XU Tianyang, SONG Xiaoning   

  1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2023-10-01 Published:2023-10-01



  1. 江南大学 人工智能与计算机学院,江苏 无锡 214122

Abstract: With the popularity of multi-sensors, multi-modal data has received continuous attention from scientific research and industry. The technology of processing multi-source modal information through deep learning is the core. Text-to-image generation is one of the directions of multi-modal technology. Because the images generated by generative adversarial network(GAN) are more realistic, the generation of text images has made excellent progress. It can be used in many fields such as image editing and colorization, style transfer, object deformation, and photo enhancement, etc. In this review, GAN networks based on image generation function are divided into four categories:semantic-enhanced GAN, growth-able GAN, diversity-enhanced GAN, and intelligence-enhanced GAN. According to the direction provided by the taxonomy, the function-based text image generation models are integrated and compared to clarify the context. The existing evaluation indicators and commonly used data sets are analyzed, and the feasibility and future development trend of complex text processing are clarified. This review systematically complements the analysis of generative adversarial networks in text image generation and will help researchers further advance this field.

Key words: multi-modal, generative adversarial network(GAN), text-to-image synthesis, deep learning

摘要: 随着多传感器的普及,多模态数据获得科研和产业面的持续关注,通过深度学习来处理多源模态信息的技术是核心所在。文本生成图像是多模态技术的方向之一,由于生成对抗网络(GAN)生成图像更具有真实感,使得文本图像生成取得卓越进展。它可用于图像编辑和着色、风格转换、物体变形、照片增强等多个领域。将基于图像生成功能的GAN网络分为四大类:语义增强GAN、可增长式GAN、多样性增强GAN、清晰度增强GAN,并根据分类法提供的方向将基于功能的文本图像生成模型进行整合比较,厘清脉络;分析了现有的评估指标以及常用的数据集,阐明了对复杂文本的处理等方面的可行性以及未来的发展趋势;系统性地补充了生成对抗网络在文本图像生成方面的分析,将有助于研究者进一步推进这一领域。

关键词: 多模态, 生成对抗网络, 文本图像生成, 深度学习