计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (10): 50-67.DOI: 10.3778/j.issn.1002-8331.2112-0151
王宇昊,何彧,王铸
出版日期:
2022-05-15
发布日期:
2022-05-15
WANG Yuhao, HE Yu, WANG Zhu
Online:
2022-05-15
Published:
2022-05-15
摘要: 文本到图像生成方法采用自然语言与图像集特征的映射方式,根据自然语言描述生成相应图像,利用语言属性智能地实现视觉图像的通用性表达。基于卷积神经网络的深度学习技术是当前文本到图像生成的主流方法,为系统地了解该领域的研究现状和发展趋势,按照模型构建及技术实现形式的不同,将已有的技术方法分为直接图像法、分层体系结构法、注意力机制法、周期一致性法、自适应非条件模型法及附加监督法共六类。分别对这些方法进行总结归纳和讨论,论述其构建思路、模型特点、优势及局限性,并对主要的评价指标开展分析对比,最后讨论该技术在模型方法、评价方法和技术改进等方面面临的挑战及未来展望。
王宇昊, 何彧, 王铸. 基于深度学习的文本到图像生成方法综述[J]. 计算机工程与应用, 2022, 58(10): 50-67.
WANG Yuhao, HE Yu, WANG Zhu. Overview of Text-to-Image Generation Methods Based on Deep Learning[J]. Computer Engineering and Applications, 2022, 58(10): 50-67.
[1] FARHADI A,ENDRES I,HOIEM D,et al.Describing objects by their attributes[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition,Miami,Jun 20-25,2009:1778-1785. [2] KUMAR N,BERG A C,BELHUMEUR P N,et al.Attribute and simile classifiers for face verification[C]//12th IEEE International Conference on Computer Vision,Kyoto,Sep 29-Oct 1,2009:365-372. [3] FU Y,HOSPEDALES T M,XIANG T,et al.Transductive multi-view embedding for zero-shot recognition and annotation[C]//13th European Conference on Computer Vision,Zurich,Sep 6-12,2014.Cham:Springer,2014:584-599. [4] AKATA Z,REED S,WALTER D,et al.Evaluation of output embeddings for fine-grained image classification[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition,Boston,Jun 7-12,2015:2927-2936. [5] GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial networks[C]//Advances in Neural Information Processing Systems 27:Annual Conference on Neural Information Processing Systems,2014:2672-2680. [6] REED S,AKATA Z,LEE H,et al.Learning deep representations of fine-grained visual descriptions[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition, Seattle,Jun 27-30,2016:49-58. [7] YANG Z,HU Z,SALAKHUTDINOV R,et al.Improved variational autoencoders for text modeling using dilated convolutions[C]//34th International Conference on Machine Learning,Sydney,Aug 6-11,2017:3881-3890. [8] YU J,LU Y,QIN Z,et al.Modeling text with graph convolutional network for cross-modal information retrieval[C]//19th Pacific-Rim Conference on Multimedia,Hefei,Sep 21-22,2018.Cham:Springer,2018:223-234. [9] LIU Y,HAN K,TAN Z,et al.Using context information for dialog act classification in DNN framework[C]//2017 Conference on Empirical Methods in Natural Language Processing,Copenhagen,Sep 2017.Stroudsburg:ACL,2017:2170-2178. [10] HIRSCHMAN L,GAIZAUSKAS R.Natural language question answering:the view from here[J].Natural Language Engineering,2001,7(4):275. [11] CHEN K,WANG J,CHEN L C,et al.ABC-CNN:an attention based convolutional neural network for visual question answering[J].arXiv:1511.05960,2015. [12] BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv:1409.0473,2014. [13] WU Y,SCHUSTER M,CHEN Z,et al.Google’s neural machine translation system:bridging the gap between human and machine translation[J].arXiv:1609.08144,2016. [14] MIRZA M,OSINDERO S.Conditional generative adversarial nets[J].arXiv:1411.1784,2014. [15] VAN DEN OORD A,KALCHBRENNER N,KAVUKCUOGLU K.Pixel recurrent neural networks[J].arXiv:1601.06759v3, 2016. [16] KINGMA D P,WELLING M.Auto-encoding variational Bayes[J].arXiv:1312.6114,2013. [17] ALAIN G,BENGIO Y,YAO L,et al.GSNs:generative stochastic networks[J].Information and Inference,2016,5(2):210-249. [18] SALAKHUTDINOV R,HINTON G E.Deep Boltzmann machines[J].Journal of Machine Learning Research,2009,5(2):1967-2006. [19] ODENA A,OLAH C,SHLENS J.Conditional image synthesis with auxiliary classifier GANs[C]//34th International Conference on Machine Learning,Sydney,Aug 6-11,2017:2642-2651. [20] 王艺陆.基于StackGAN的文本图像生成问题研究[D].大连:大连理工大学,2021. WANG Y L.Research on text image generation based on StackGAN[D].Dalian:Dalian University of Technology,2021. [21] REED S,AKATA Z,YAN X,et al.Generative adversarial text to image synthesis[C]//33rd International Conference on Machine Learning,New York,Jun 20-22,2016:1060-1069. [22] DASH A,GAMBOA J,AHMED S,et al.TAC-GAN-Text conditioned auxiliary classifier generative adversarial network[J].arXiv:1703.06412,2017. [23] ZHANG H,XU T,LI H,et al.StackGAN:text to photo-realistic image synthesis with stacked generative adversarial networks[C]//16th IEEE International Conference on Computer Vision,Venice,Oct 22-29,2017:5907-5915. [24] ZHANG H,XU T,LI H,et al.StackGAN++:realistic image synthesis with stacked generative adversarial networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(8):1947-1962. [25] 王昭慧.基于生成对抗网络的有条件图像生成研究[D].天津:天津理工大学,2021. WANG Z H.Research on conditional image generation based on generative adversarial networks[D].Tianjin:Tianjin University of Technology,2021. [26] ZHANG Z,XIE Y,YANG L.Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]//31st IEEE Conference on Computer Vision and Pattern Recognition,Salt Lake City,Jun 18-23,2018:6199-6208. [27] GAO L,CHEN D,SONG J,et al.Perceptual pyramid adversarial networks for text-to-image synthesis[C]//33rd AAAI Conference on Artificial Intelligence,Honolulu,Jan 27-Feb 1,2019:8312-8319. [28] LIN T Y,DOLLáR P,GIRSHICK R,et al.Feature pyramid networks for object detection[C]//2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Honolulu,Jul 21-26,2017:2117-2125. [29] TAO M,TANG H,WU S,et al.DF-GAN:deep fusion generative adversarial networks for text-to-image synthesis[J].arXiv:2008.05865,2020. [30] HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition,Seattle,Jun 27-30,2016:770-778. [31] 黄韬.文本到人物图像的跨模态生成研究[D].广州:广东技术师范大学,2020. HUANG T.Research on cross-modal generation from text to character image[D].Guangzhou:Guangdong Technical Normal University,2020. [32] 吴禹,靳华中.基于文本层级结构的图像描述生成算法[J].湖北工业大学学报,2021,36(4):17-21. WU Y,JIN H Z.Image description generation algorithm based on text hierarchy[J].Journal of Hubei University of Technology,2021,36(4):17-21. [33] YANG Z,YANG D,DYER C,et al.Hierarchical attention networks for document classification[C]//2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,2016:1480-1489. [34] YOUNG T,HAZARIKA D,PORIA S,et al.Recent trends in deep learning based natural language processing[J].IEEE Computational Intelligence Magazine,2018,13(3):55-75. [35] XU T,ZHANG P,HUANG Q,et al.AttnGAN:fine-grained text to image generation with attentional generative adversarial networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Salt Lake City,Jun 18-23,2018:1316-1324. [36] HUANG W,XU Y,OPPERMANN I.Realistic image generation using region-phrase attention[C]//11th Asian Conference on Machine Learning,Nagoya,Nov 17-19,2019:284-299. [37] 胡北辰.基于GAN的文本生成图像算法研究[J].信阳农林学院学报,2021,31(3):115-118. HU B C.Research on text image generation algorithm based on GAN[J].Journal of Xinyang University of Agriculture and Forestry,2021,31(3):115-118. [38] TAN H,LIU X,LI X,et al.Semantics-enhanced adversarial nets for text-to-image synthesis[C]//2019 IEEE/CVF International Conference on Computer Vision,Seoul,Oct 27-Nov 2,2019:10501-10510. [39] LI B,QI X,LUKASIEWICZ T,et al.Controllable text-to-image generation[J].arXiv:1909.07083,2019. [40] YIN G,LIU B,SHENG L,et al.Semantics disentangling for text-to-image generation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,Jun 16-20,2019:2327-2336. [41] DUMOULIN V,SHLENS J,KUDLUR M.A learned representation for artistic style[J].arXiv:1610.07629,2016. [42] LIN T Y,GOYAL P,GIRSHICK R,et al.Focal loss for dense object detection[C]//16th IEEE International Conference on Computer Vision,Venice,Oct 22-29,2017:2980-2988. [43] CHA M,GWON Y L,KUNG H T.Adversarial learning of semantic relevance in text to image synthesis[C]//33rd AAAI Conference on Artificial Intelligence,Honolulu,Jan 27-Feb 1,2019:3272-3279. [44] 汪敏.基于跨模态语义关系的图像生成关键技术研究[D].北京:北京交通大学,2021. WANG M.Research on key technologies of image generation based on cross-modal semantic relations[D].Beijing:Beijing Jiaotong University,2021. [45] LAO Q,HAVAEI M,PESARANGHADER A,et al.Dual adversarial inference for text-to-image synthesis[C]//2019 IEEE/CVF International Conference on Computer Vision,Seoul,Oct 27-Nov 2,2019:7567-7576. [46] NGUYEN A,CLUNE J,BENGIO Y,et al.Plug & play generative networks:conditional iterative generation of images in latent space[C]//2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Honolulu,Jul 21-26,2017:4467-4477. [47] ZHU J Y,PARK T,ISOLA P,et al.Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//16th IEEE International Conference on Computer Vision,Venice,Oct 22-29,2017:2223-2232. [48] QIAO T,ZHANG J,XU D,et al.MirrorGAN:learning text-to-image generation by redescription[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,Jun 16-20,2019:1505-1514. [49] ZHU M,PAN P,CHEN W,et al.DM-GAN:dynamic memory generative adversarial networks for text-to-image synthesis[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,Jun 16-20,2019:5802-5810. [50] STAP D,BLEEKER M,IBRAHIMI S,et al.Conditional image generation and manipulation for user-specified content[J].arXiv:2005.04909,2020. [51] 胡涛.基于生成对抗网络的文本描述图像生成研究[D].合肥:中国科学技术大学,2021. HU T.Research on text description image generation based on generative confrontation network[D].Hefei:University of Science and Technology of China,2021. [52] 徐泽,帅仁俊,刘开凯,等.基于特征融合的文本到图像的生成[J].计算机科学,2021,48(6):125-130. XU Z,SHUAI R J,LIU K K,et al.Generation of text to image based on feature fusion[J].Computer Science,2021,48(6):125-130. [53] YUAN M,PENG Y.Bridge-GAN:interpretable representation learning for text-to-image synthesis[J].IEEE Transactions on Circuits and Systems for Video Technology,2019,30(11):4258-4268. [54] KARRAS T,AILA T,LAINE S,et al.Progressive growing of GANs for improved quality,stability,and variation[J].arXiv:1710.10196,2017. [55] WANG Z,QUAN Z,WANG Z J,et al.Text to image synthesis with bidirectional generative adversarial network[C]//2020 IEEE International Conference on Multimedia and Expo,Jul 6-10,2020:1-6. [56] BROCK A,DONAHUE J,SIMONYAN K.Large scale GAN training for high fidelity natural image synthesis[J].arXiv:1809.11096,2018. [57] JOSEPH K J,PAL A,RAJANALA S,et al.C4Synth:cross-caption cycle-consistent text-to-image synthesis[C]//19th IEEE Winter Conference on Applications of Computer Vision,Waikoloa Village,Jan 7-11,2019:358-366. [58] CHENG J,WU F,TIAN Y,et al.RiFeGAN:rich feature generation for text-to-image synthesis from prior knowledge[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Seattle,Jun 13-19,2020:10911-10920. [59] NIU T,FENG F,LI L,et al.Image synthesis from locally related texts[C]//2020 International Conference on Multimedia Retrieval,2020:145-153. [60] HINZ T,HEINRICH S,WERMTER S.Generating multiple objects at spatially distinct locations[C]//2019 International Conference on Learning Representations,New Orleans,May 6-9,2019. [61] HINZ T,HEINRICH S,WERMTER S G.Semantic object accuracy for generative text-to-image synthesis[J].arXiv:1910.13321,2019. [62] SYLVAIN T,ZHANG P,BENGIO Y,et al.Object-centric image generation from layouts[J].arXiv:2003.07449,2020. [63] HONG S,YANG D,CHOI J,et al.Inferring semantic layout for hierarchical text-to-image synthesis[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Salt Lake City,Jun 18-23,2018:7986-7994. [64] QIAO T,ZHANG J,XU D,et al.Learn,imagine and create:text-to-image generation from prior knowledge[C]//Advances in Neural Information Processing Systems 32:Annual Conference on Neural Information Processing Systems 2019,Vancouver,Dec 8-14,2019:887-897. [65] WANG M,LANG C,LIANG L,et al.Attentive generative adversarial network to bridge multi-domain gap for image synthesis[C]//2020 International Conference on Multimedia and Expo,Jul 6-10,2020:1-6. [66] PAVLLO D,LUCCHI A,HOFMANN T.Controlling style and semantics in weakly-supervised image generation[C]//16th European Conference on Computer Vision.Cham:Springer,2020:482-499. [67] JOHNSON J E,GUPTA A,LI F F.Image generation from scene graphs[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Salt Lake City,Jun 18-23,2018:1219-1228. [68] VO D M,SUGIMOTO A.Visual-relation conscious image generation from structured-text[C]//16th European Conference on Computer Vision.Cham:Springer,2020:290-306. [69] SHI X,CHEN Z,WANG H,et al.Convolutional LSTM network:a machine learning approach for precipitation nowcasting[C]//29th Annual Conference on Neural Information Processing Systems,Montreal,Dec 7-12,2015:802-810. [70] LI Y,MA T,BAI Y,et al.PasteGAN:a semi-parametric method to generate image from scene graph[C]//33rd Conference on Neural Information Processing Systems,Vancouver,Dec 8-14,2019:3948-3958. [71] LUCIC M,KURACH K,MICHALSKI M,et al.Are GANs created equal? A large-scale study[C]//Advances in Neural Information Processing Systems 31:Annual Conference on Neural Information Processing Systems 2018,Montréal,Dec 3-8,2018:698-707. [72] ISOLA P,ZHU J Y,ZHOU T,et al.Image-to-image translation with conditional adversarial networks[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Honolulu,Jul 21-26,2017:1125-1134. [73] GOODFELLOW I J,SHLENS J,SZEGEDY C.Explaining and harnessing adversarial examples[J].arXiv:1412.6572,2014. [74] HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]//Advances in Neural Information Processing Systems 30:Annual Conference on Neural Information Processing Systems 2017,Long Beach,Dec 4-9,2017:6626-6637. [75] GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]//Advances in Neural Information Processing Systems 27:Annual Conference on Neural Information Processing Systems 2014,Montreal,Dec 8-13,2014:2672-2680. [76] IM D J,KIM C D,JIANG H,et al.Generating images with recurrent adversarial networks[J].arXiv:1602.05110,2016. [77] CHE T,LI Y,JACOB A P,et al.Mode regularized generative adversarial networks[J].arXiv:1612.02136,2016. [78] RAZAVI A,VAN DEN OORD A,VINYALS O.Generating diverse high-fidelity images with VQ-VAE-2[C]//33rd Conference on Neural Information Processing Systems,Vancouver,Dec 8-14,2019:14866-14876. [79] VAN OORD A,KALCHBRENNER N,KAVUKCUOGLU K.Pixel recurrent neural networks[C]//33rd International Conference on Machine Learning,New York,Jun 20-22,2016:1747-1756. [80] OORD A,KALCHBRENNER N,VINYALS O,et al.Conditional image generation with PixelCNN decoders[C]//30th Conference on Neural Information Processing Systems,Barcelona,2016:4790-4798. [81] MENICK J,KALCHBRENNER N.Generating high fidelity images with subscale pixel networks and multidimensional upscaling[J].arXiv:1812.01608,2018. [82] DINH L,KRUEGER D,BENGIO Y.NICE:non-linear independent components estimation[J].arXiv:1410.8516,2014. [83] DINH L,SOHL-DICKSTEIN J,BENGIO S.Density estimation using real NVP[J].arXiv:1605.08803,2016. [84] HYV?RINEN A,DAYAN P.Estimation of non-normalized statistical models by score matching[J].Journal of Machine Learning Research,2005,6(4):695-709. [85] SONG Y,ERMON S.Generative modeling by estimating gradients of the data distribution[J].arXiv:1907.05600,2019. [86] JOLICOEUR-MARTINEAU A,PICHé-TAILLEFER R,COMBES R T,et al.Adversarial score matching and improved sampling for image generation[J].arXiv:2009.05475,2020. [87] PARMAR N,VASWANI A,USZKOREIT J,et al.Image transformer[C]//35th International Conference on Machine Learning,Stockholm,Jul 10-15,2018:4055-4064. [88] CHEN M,RADFORD A,CHILD R,et al.Generative pretraining from pixels[C]//2020 International Conference on Machine Learning,Jul 13-18,2020:1691-1703. [89] ESSER P,ROMBACH R,OMMER B.Taming transformers for high-resolution image synthesis[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Nashville,Jun 21-24,2021:12873-12883. [90] ZHANG R,ISOLA P,EFROS A A,et al.The unreasonable effectiveness of deep features as a perceptual metric[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Salt Lake City,Jun 18-23,2018:586-595. [91] LI W,ZHANG P,ZHANG L,et al.Object-driven text-to-image synthesis via adversarial training[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Long Beach,Jun 16-20,2019:12174-12182. [92] ZHANG H,KOH J Y,BALDRIDGE J,et al.Cross-modal contrastive learning for text-to-image generation[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition,Nashville,Jun 21-24,2021:833-842. |
[1] | 高广尚. 深度学习推荐模型中的注意力机制研究综述[J]. 计算机工程与应用, 2022, 58(9): 9-18. |
[2] | 吉梦, 何清龙. AdaSVRG:自适应学习率加速SVRG[J]. 计算机工程与应用, 2022, 58(9): 83-90. |
[3] | 罗向龙, 郭凰, 廖聪, 韩静, 王立新. 时空相关的短时交通流宽度学习预测模型[J]. 计算机工程与应用, 2022, 58(9): 181-186. |
[4] | 阿里木·赛买提, 斯拉吉艾合麦提·如则麦麦提, 麦合甫热提, 艾山·吾买尔, 吾守尔·斯拉木, 吐尔根·依不拉音. 神经机器翻译面对句长敏感问题的研究[J]. 计算机工程与应用, 2022, 58(9): 195-200. |
[5] | 陈一潇, 阿里甫·库尔班, 林文龙, 袁旭. 面向拥挤行人检测的CA-YOLOv5[J]. 计算机工程与应用, 2022, 58(9): 238-245. |
[6] | 方义秋, 卢壮, 葛君伟. 联合RMSE损失LSTM-CNN模型的股价预测[J]. 计算机工程与应用, 2022, 58(9): 294-302. |
[7] | 张鑫, 姚庆安, 赵健, 金镇君, 冯云丛. 全卷积神经网络图像语义分割方法综述[J]. 计算机工程与应用, 2022, 58(8): 45-57. |
[8] | 石颉, 袁晨翔, 丁飞, 孔维相. SAR图像建筑物目标检测研究综述[J]. 计算机工程与应用, 2022, 58(8): 58-66. |
[9] | 熊风光, 张鑫, 韩燮, 况立群, 刘欢乐, 贾炅昊. 改进的遥感图像语义分割研究[J]. 计算机工程与应用, 2022, 58(8): 185-190. |
[10] | 杨锦帆, 王晓强, 林浩, 李雷孝, 杨艳艳, 李科岑, 高静. 深度学习中的单阶段车辆检测算法综述[J]. 计算机工程与应用, 2022, 58(7): 55-67. |
[11] | 王斌, 李昕. 融合动态残差的多源域自适应算法研究[J]. 计算机工程与应用, 2022, 58(7): 162-166. |
[12] | 谭暑秋, 汤国放, 涂媛雅, 张建勋, 葛盼杰. 教室监控下学生异常行为检测系统[J]. 计算机工程与应用, 2022, 58(7): 176-184. |
[13] | 朱学超, 张飞, 高鹭, 任晓颖, 郝斌. 基于残差网络和门控卷积网络的语音识别研究[J]. 计算机工程与应用, 2022, 58(7): 185-191. |
[14] | 张美玉, 刘跃辉, 侯向辉, 秦绪佳. 基于卷积网络的灰度图像自动上色方法[J]. 计算机工程与应用, 2022, 58(7): 229-236. |
[15] | 张壮壮, 屈立成, 李翔, 张明皓, 李昭璐. 基于时空卷积神经网络的数据缺失交通流预测[J]. 计算机工程与应用, 2022, 58(7): 259-265. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||