
计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (24): 44-64.DOI: 10.3778/j.issn.1002-8331.2405-0048
高欣宇,杜方,宋丽娟
出版日期:2024-12-15
发布日期:2024-12-12
GAO Xinyu, DU Fang, SONG Lijuan
Online:2024-12-15
Published:2024-12-12
摘要: 随着深度学习的不断发展,人工智能生成内容成为了一个热门话题,特别是扩散模型作为一种新兴的生成模型,在文本图像生成领域取得了显著进展。全面描述了扩散模型在文本图像生成任务中的应用,并与生成对抗网络和自回归模型的对比分析,揭示了扩散模型的优势和局限性。同时深入探讨了扩散模型在提升图像质量、优化模型效率以及多语言文本图像生成方面的具体方法,通过在CUB、COCO和T2I-CompBench数据集上进行了实验分析,不仅验证了扩散模型零样本生成的能力,还凸显了其根据复杂文本提示生成高质量图像的能力。介绍了扩散模型在文本图像编辑、3D生成、视频及医学图像生成等领域的应用前景。总结了扩散模型在文本图像生成任务上面临的挑战以及未来的发展趋势,有助于研究者更深入地推进这一领域的研究。
高欣宇, 杜方, 宋丽娟. 基于扩散模型的文本图像生成对比研究综述[J]. 计算机工程与应用, 2024, 60(24): 44-64.
GAO Xinyu, DU Fang, SONG Lijuan. Comparative Review of Text-to-Image Generation Techniques Based on Diffusion Models[J]. Computer Engineering and Applications, 2024, 60(24): 44-64.
| [1] ALKHAWLANI M, ELMOGY M, EL BAKRY H. Text-based, content-based, and semantic-based image retrievals: a survey[J]. International Journal of Computer and Information Technology, 2015, 4(1): 58-66. [2] LI W, DUAN L X, XU D, et al. Text-based image retrieval using progressive multi-instance learning[C]//Proceedings of the 2011 International Conference on Computer Vision, 2011: 2049-2055. [3] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Advances in Neural Information Processing Systems, 2014: 2672-2680. [4] HO J, JAIN A, ABBEEL P. Denoising diffusion probabilis-tic models[C]//Advances in Neural Information Processing Systems, 2020: 6840-6851. [5] 赖丽娜, 米瑜, 周龙龙, 等. 生成对抗网络与文本图像生成方法综述[J]. 计算机工程与应用, 2023, 59(19): 21-39. LAI L N, MI Y, ZHOU L L, et al. Survey about generative adversarial network and text-to-image synthesis[J]. Computer Engineering and Applications, 2023, 59(19): 21-39. [6] 王威, 李玉洁, 郭富林, 等. 生成对抗网络及其文本图像合成综述[J]. 计算机工程与应用, 2022, 58(19): 14-36. WANG W, LI Y J, GUO F L, et al. Survey about generative adversarial network based text-to-image synthesis[J]. Computer Engineering and Applications, 2022, 58(19): 14-36. [7] 陈佛计, 朱枫, 吴清潇, 等. 生成对抗网络及其在图像生成中的应用研究综述[J]. 计算机学报, 2021, 44(2): 347-369. CHEN F J, ZHU F, WU Q X, et al. A survey about image generation with generative adversarial nets[J]. Chinese Journal of Computers, 2021, 44(2): 347-369. [8] CROITORU F A, HONDRU V, IONESCU R T, et al. Diffusion models in vision: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(9): 10850-10869. [9] YANG L, ZHANG Z L, SONG Y, et al. Diffusion models: a comprehensive survey of methods and applications[J]. ACM Computing Surveys, 2023, 56(4): 1-39. [10] ZHAN F N, YU Y C, WU R L, et al. Multimodal image synthesis and editing: a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(12): 15098-15119. [11] CAO P, ZHOU F, SONG Q, et al. Controllable generation with text-to-image diffusion models: a survey[J]. arXiv:2403. 04279, 2024. [12] 胡铭菲, 左信, 刘建伟. 深度生成模型综述[J]. 自动化学报, 2022, 48(1): 40-74. HU M F, ZUO X, LIU J W. Survey on deep generative model[J]. Acta Automatica Sinica, 2022, 48(1): 40-74. [13] SOHL-DICKSTEIN J, WEISS E, MAHESWARANAT HAN N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]//Procedings of the International Conference on Machine Learning, 2015: 2256-2265. [14] DHARIWAL P, NICHOL A. Diffusion models beat GANs on image synthesis[C]//Advances in Neural Information Processing Systems, 2021: 8780-8794. [15] HO J, SALIMANS T. Classifier-free diffusion guidance[J]. arXiv:2207.12598, 2022. [16] KIM G, KWON T, YE J C. Diffusionclip: text-guided diffusion models for robust image manipulation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 2426-2435. [17] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervi-sion[C]//Procedings of the International Conference on Machine Learning, 2021: 8748-8763. [18] REED S, AKATA Z, YAN X, et al. Generative adversarial text to image synthesis[C]//Procedings of the International Conference on Machine Learning, 2016: 1060-1069. [19] RADFORD A, METZ L, CHINTALA S. Unsupervised representation learning with deep convolutional generative adversarial networks[J]. arXiv:1511.06434, 2015. [20] ZHANG H, XU T, LI H, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 5907-5915. [21] ZHANG H, XU T, LI H, et al. StackGAN++: realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(8): 1947-1962. [22] TAN H, LIU X, LIU M, et al. KT-GAN: knowledge-transfer generative adversarial network for text-to-image synthesis[J]. IEEE Transactions on Image Processing, 2020, 30: 1275-1290. [23] RUAN S, ZHANG Y, ZHANG K, et al. DAE-GAN: dynamic aspect-aware GAN for text-to-image synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 13960-13969. [24] YIN G, LIU B, SHENG L, et al. Semantics disentangling for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 2327-2336. [25] XU T, ZHANG P, HUANG Q, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 1316-1324. [26] ZHU M, PAN P, CHEN W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 5802-5810. [27] ZHOU Y, ZHANG R, CHEN C, et al. LAFITE: towards language-free training for text-to-image generation[J]. arXiv:2111.13792, 2021. [28] TAO M, BAO B K, TANG H, et al. GALIP: generative adversarial CLIPs for text-to-image synthsis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 14214-14223. [29] TAO M, TANG H, WU F, et al. DF-GAN: a simple and effective baseline for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 16515-16525. [30] ZHANG H, KOH J Y, BALDRIDGE J, et al. Cross-modal contrastive learning for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 833-842. [31] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[C]//Advances in Neural Information Processing Systems, 2018: 8735-8745. [32] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017. [33] RAMESH A, PAVLOV M, GOH G, et al. Zero-shot text-to-image generation[C]//Proceedings of the International Conference on Machine Learning, 2021: 8821-8831. [34] GAGE P. A new algorithm for data compression[J]. The C Users Journal, 1994, 12(2): 23-38. [35] DING M, YANG Z, HONG W, et al. CogView: mastering text-to-image generation via transformers[C]//Advances in Neural Information Processing Systems, 2021: 19822-19835. [36] DING M, ZHENG W, HONG W, et al. CogView2: faster and better text-to-image generation via hierarchical transformers[C]//Advances in Neural Information Processing Systems, 2022: 16890-16902. [37] YU J, XU Y, KOH J Y, et al. Scaling autoregressive models for content-rich text-to-image generation[J]. arXiv:2206.10789, 2022. [38] GAFNI O, POLYAK A, ASHUAL O, et al. Make-a-scene: scene-based text-to-image generation with human priors[C]//Proceedings of the European Conference on Computer Vision, 2022: 89-106. [39] NICHOL A Q, DHARIWAL P, RAMESH A, et al. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models[C]//Proceedings of the Interna-tional Conference on Machine Learning, 2022: 16784-16804. [40] RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical text-conditional image generation with clip latents[J]. arXiv:2204.06125, 2022. [41] GU S, CHEN D, BAO J, et al. Vector quantized diffusion model for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 10696-10706. [42] TANG Z, GU S, BAO J, et al. Improved vector quantized diffusion models[J]. arXiv:2205.16007, 2022. [43] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 10684-10695. [44] SAHARIA C, CHAN W, SAXENA S, et al. Photorealistic text-to-image diffusion models with deep language under-standing[C]//Advances in Neural Information Processing Systems, 2022: 36479-36494. [45] LI R, LI W, YANG Y, et al. Swinv2-Imagen: hierarchical vision transformer diffusion models for text-to-image generation[J]. Neural Computing and Applications, 2024, 36: 17245-17260. [46] LI W, XU X, XIAO X, et al. Upainting: unified text-to-image diffusion generation with cross-modal guidance[J]. arXiv:2210.16031, 2022. [47] BETKER J, GOH G, JING L, et al. Improving image gen-eration with better captions[J]. Computer Science, 2023, 2(3): 8. [48] YANG L, YU Z, MENG C, et al. Mastering text-to-image diffusion: recaptioning, planning, and generating with mul-timodal LLMs[J]. arXiv:2401.11708, 2024. [49] ZHANG X, YANG L, CAI Y, et al. RealCompo: dynamic equilibrium between realism and compositionality improves text-to-image diffusion models[J]. arXiv:2402.12908, 2024. [50] ZHANG Z, ZHAO Z, YU J, et al. ShiftDDPMs: exploring conditional diffusion models by shifting diffusion trajecto-ries[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2023: 3552-3560. [51] ZHOU Y, LIU B, ZHU Y, et al. Shifted diffusion for text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 10157-10166. [52] BALAJI Y, NAH S, HUANG X, et al. eDiff-I: text-to-image diffusion models with an ensemble of expert denoisers[J]. arXiv:2211.01324, 2022. [53] FENG Z, ZHANG Z, YU X, et al. ERNIE-ViLG 2. 0: improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 10135-10145. [54] XUE Z, SONG G, GUO Q, et al. RAPHAEL: text-to-image generation via large mixture of diffusion paths[C]//Advances in Neural Information Processing Systems, 2024: 41693-41706. [55] YANG L, LIU J, HONG S, et al. Improving diffusion-based image synthesis with context prediction[C]//Advances in Neural Information Processing Systems, 2024: 37636-37656. [56] CHEN S, XU M, REN J, et al. GenTron: delving deepinto diffusion transformers for image and video generation[J]. arXiv:2312.04557, 2023. [57] CHEN J, YU J, GE C, et al. PixArt-alpha: fast training of diffusion transformer for photorealistic text-to-image synthesis[C]//Proceedings of the Twelfth International Conference on Learning Representations, 2023. [58] ESSER P, KULAL S, BLATTMANN A, et al. Scaling rec-tified flow transformers for high-resolution image synthesis[J]. arXiv:2403.03206, 2024. [59] LU G, GUO Y, HAN J, et al. PanGu-Draw: advancing resource-efficient text-to-image synthesis with time-decoupled training and reusable coop-diffusion[J]. arXiv:2312.16486, 2023. [60] DAI X, HOU J, MA C Y, et al. Emu: enhancing image generation models using photogenic needles in a haystack[J]. arXiv:2309.15807, 2023. [61] HUANG K, SUN K, XIE E, et al. T2i-compbench: a com-prehensive benchmark for open-world compositional text-to-image generation[C]//Advances in Neural Information Processing Systems, 2023: 78723-78747. [62] JIANG D, SONG G, WU X, et al. CoMat: aligning text-to-image diffusion model with image-to-text concept matching[J]. arXiv:2404.03653, 2024. [63] HU X, WANG R, FANG Y, et al. ELLA: equip diffusion models with LLM for enhanced semantic alignment[J]. arXiv:2403. 05135, 2024. [64] CHEN W, HU H, SAHARIA C, et al. Re-Imagen: retrieval-augmented text-to-image generator[C]//Proceedings of the Eleventh International Conference on Learning Representations, 2022. [65] LIU N, LI S, DU Y, et al. Compositional visual generation with composable diffusion models[C]//Proceedings of the European Conference on Computer Vision, 2022: 423-439. [66] FENG W, HE X, FU T J, et al. Training-free structured diffusion guidance for compositional text-to-image synthesis[C]//Proceedings of the Eleventh International Conference on Learning Representations, 2022. [67] CHEFER H, ALALUF Y, VINKER Y, et al. Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models[J]. ACM Transactions on Graphics (TOG), 2023, 42(4): 1-10. [68] RAZZHIGAEV A, SHAKHMATOV A, MALTSEVA A, et al. Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023: 286-295. [69] PODELL D, ENGLISH Z, LACEY K, et al. SDXL: improving latent diffusion models for high-resolution image synthesis[C]//Proceedings of the Twelfth International Conference on Learning Representations, 2023. [70] PERNIAS P, RAMPAS D, AUBREVILLE M. Wuerstchen: efficient pretraining of text-to-image models[J]. arXiv:2306. 00637, 2023. [71] PATEL M, KIM C, CHENG S, et al. Eclipse: a resource-efficient text-to-image prior for image generations[J]. arXiv:2312.04655, 2023. [72] ZHENG H, HE P, CHEN W, et al. Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders[J]. arXiv:2202.09671, 2022. [73] XU Y, ZHAO Y, XIAO Z, et al. UFOGen: you forward once large scale text-to-image generation via diffusion GANs[J]. arXiv:2311.09257, 2023. [74] SHEYNIN S, ASHUAL O, POLYAK A, et al. Knn-diffusion: image generation via large-scale retrieval[J]. arXiv:2204. 02849, 2022. [75] KIM B K, SONG H K, CASTELLS T, et al. On architec-tural compression of text-to-image diffusion models[J]. arXiv:2305.15798, 2023. [76] KOHLER J, PUMAROLA A, SCH?NFELD E, et al. Imagine flash: accelerating emu diffusion models with backward distillation[J]. arXiv:2405.05224, 2024. [77] PHIL W. Rudalle: generate images from texts[EB/OL]. (2022)[2022-06-23]. https://github.com/ai-forever/ru-dalle. [78] WANG C, DUAN Z, LIU B, et al. PAI-Diffusion: constructing and serving a family of open chinese diffusion models for text-to-image synthesis on the cloud[J]. arXiv:2309.05534, 2023. [79] WU X, ZHANG D, GAN R, et al. Taiyi-Diffusion-XL: advancing bilingual text-to-image generation with large vision-language model support[J]. arXiv:2401.14688, 2024. [80] LI Z, ZHANG J, LIN Q, et al. Hunyuan-DiT: a powerful multi-resolution diffusion transformer with fine-grained Chinese understanding[J]. arXiv:2405.08748, 2024. [81] YE F, LIU G, WU X, et al. AltDiffusion: a multilingual text-to-image diffusion model[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2024: 6648-6656. [82] VAN DEN OORD A, VINYALS O. Neural discrete repre-sentation learning[C]//Advances in Neural Information Processing Systems, 2017: 6309-6318. [83] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551. [84] LI Y, LIU H, WU Q, et al. GLIGEN: open-set grounded text-to-image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 22511-22521. [85] PEEBLES W, XIE S. Scalable diffusion models with trans-formers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 4195-4205. [86] WAH C, BRANSON S, WELINDER P, et al. The caltech-ucsd birds-200-2011 dataset[D]. California Institute of Technology, 2011: 1-8. [87] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision (ECCV 2014), Zurich, Switzerland, September 6-12, 2014: 740-755. [88] NILSBACK M E, ZISSERMAN A. Automated flower classification over a large number of classes[C]//Proceedings of the 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008: 722-729. [89] KARRAS T, AILA T, LAINE S, et al. Progressive growing of GANs for improved quality, stability, and variation[C]//Proceedings of the 6th International Conference on Learning Representations, 2018. [90] BAKR E M, SUN P, SHEN X, et al. HRS-bench: holistic, reliable and scalable benchmark for text-to-image models[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 20041-20053. [91] SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training gans[C]//Advances in Neural Information Processing Systems, 2016: 2234-2242. [92] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium[C]//Advances in Neural Information Processing Systems, 2017: 6629-6640. [93] DONG H, YU S, WU C, et al. Semantic image synthesis via adversarial learning[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 5706-5714. [94] LI B, QI X, LUKASIEWICZ T, et al. ManiGAN: text-guided image manipulation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 7880-7889. [95] PATASHNIK O, WU Z, SHECHTMAN E, et al. StyleCLIP: text-driven manipulation of stylegan imagery[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 2085-2094. [96] XIA W, YANG Y, XUE J H, et al. TediGAN: text-guided diverse face image generation and manipulation[C]//Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), 2021: 2256-2265. [97] KARRAS T, LAINE S, AILA T. A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4401-4410. [98] PERNU? M, FOOKES C, ?TRUC V, et al. FICE: text-conditioned fashion image editing with guided GAN inversion[J]. arXiv:2301.02110, 2023. [99] SHEN Y, YANG C, TANG X, et al. InterFaceGAN: inter-preting the disentangled face representation learned by GANs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 44(4): 2004-2018. [100] BAI Y, ZHONG Z, DONG C, et al. Towards arbitrary text-driven image manipulation via space alignment[J]. arXiv:2301.10670, 2023. [101] COUAIRON G, VERBEEK J, SCHWENK H, et al. DiffEdit: diffusion-based semantic image editing with mask guidance[J]. arXiv:2210.11427, 2022. [102] KAWAR B, ZADA S, LANG O, et al. Imagic: text-based real image editing with diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 6007-6017. [103] HERTZ A, MOKADY R, TENENBAUM J, et al. Prompt-to-prompt image editing with cross-attention control[C]//Proceedings of the Eleventh International Conference on Learning Representations, 2022. [104] MOKADY R, HERTZ A, ABERMAN K, et al. Null-text inversion for editing real images using guided diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 6038-6047. [105] ELARABAWY A, KAMATH H, DENTON S. Direct inversion: optimization-free text-driven real image editing with diffusion models[J]. arXiv:2211.07825, 2022. [106] LI S, VAN DE WEIJER J, HU T, et al. StyleDiffusion: prompt-embedding inversion for text-based editing[J]. arXiv:2303. 15649, 2023. [107] MIYAKE D, IOHARA A, SAITO Y, et al. Negative-prompt inversion: fast image inversion for editing with text-guided diffusion models[J]. arXiv:2305.16807, 2023. [108] XU S, HUANG Y, PAN J, et al. Inversion-free image edit-ing with natural language[J]. arXiv:2312.04965, 2023. [109] CHEN K, CHOY C B, SAVVA M, et al. Text2shape: gen-erating shapes from natural language by learning joint embeddings[C]//Proceedings of the 14th Asian Conference on Computer Vision (ACCV 2018), Perth, Australia, December 2-6, 2018: 100-116. [110] FUKAMIZU K, KONDO M, SAKAMOTO R. Generation high resolution 3D model from natural language by generative adversarial network[J]. arXiv:1901.07165, 2019. [111] WEI J, WANG H, FENG J, et al. TAPS3D: text-guided 3D textured shape generation from pseudo supervision[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 16805-16815. [112] HUANG T, ZENG Y, DONG B, et al. TextField3D: towards enhancing open-vocabulary 3D generation with noisy text fields[J]. arXiv:2309.17175, 2023. [113] SANGHI A, CHU H, LAMBOURNE J G, et al. CLIP-forge: towards zero-shot text-to-shape generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 18603-18613. [114] MITTAL P, CHENG Y C, SINGH M, et al. AutoSDF: shape priors for 3D completion, reconstruction and generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 306-315. [115] QIAN X, WANG Y, LUO S, et al. Pushing auto-regressive models for 3D shape generation at capacity and scalability[J]. arXiv:2402.12225, 2024. [116] POOLE B, JAIN A, BARRON J T, et al. DreamFusion: text-to-3D using 2D diffusion[J]. arXiv:2209.14988, 2022. [117] WANG H, DU X, LI J, et al. Score jacobian chaining: lifting pretrained 2D diffusion models for 3D generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 12619-12629. [118] LIN C H, GAO J, TANG L, et al. Magic3D: high-resolution text-to-3D content creation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 300-309. [119] METZER G, RICHARDSON E, PATASHNIK O, et al. Latent-NeRF for shape-guided generation of 3D shapes and textures[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 12663-12673. [120] XU J, WANG X, CHENG W, et al. Dream3D: zero-shot text-to-3D synthesis using 3D shape prior and text-to-image diffusion models[C]//Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, 2023: 20908-20918. [121] SEO J, JANG W, KWAK M S, et al. Let 2D diffusion model know 3D-consistency for robust text-to-3D generation[C]//Proceedings of the Twelfth International Conference on Learning Representations, 2023. [122] HONG S, AHN D, KIM S. Debiasing scores and prompts of 2D diffusion for view-consistent text-to-3D generation[C]//Advances in Neural Information Processing Systems, 2024. [123] CHEN R, CHEN Y, JIAO N, et al. Fantasia3D: disentan-gling geometry and appearance for high-quality text-to-3D content creation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 22246-22256. [124] SHI Y, WANG P, YE J, et al. MVDream: multi-view diffu-sion for 3D generation[J]. arXiv:2308.16512, 2023. [125] LI Y, MIN M, SHEN D, et al. Video generation from text[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018. [126] PAN Y, QIU Z, YAO T, et al. To create what you tell: gen-erating videos from captions[C]//Proceedings of the 25th ACM International Conference on Multimedia, 2017: 1789-1798. [127] BALAJI Y, MIN M R, BAI B, et al. Conditional GAN with discriminative filter generation for text-to-video synthesis[C]//Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019: 1995-2001. [128] CHEN Q, WU Q, CHEN J, et al. Scripted video generation with a bottom-up generative adversarial network[J]. IEEE Transactions on Image Processing, 2020, 29: 7454-7467. [129] KIM D, JOO D, KIM J. TiVGAN: text to image to video generation with step-by-step evolutionary generator[J]. IEEE Access, 2020, 8: 153113-153122. [130] MEHMOOD R, BASHIR R, GIRI K J. ODD-VGAN: optimised dual discriminator video generative adversarial network for text-to-video generation with heuristic strategy[J]. Journal of Information & Knowledge Management, 2023: 2350041. [131] MEHMOOD R, BASHIR R, GIRI K J. VTM-GAN: video-text matcher based generative adversarial network for generating videos from textual description[J]. International Journal of Information Technology, 2024, 16(1): 221-236. [132] HONG W, DING M, ZHENG W, et al. CogVideo: large-scale pretraining for text-to-video generation via transformers[C]//Proceedings of the Eleventh International Conference on Learning Representations, 2022. [133] YU L, CHENG Y, SOHN K, et al. MAGVIT: masked generative video transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 10459-10469. [134] YU L, LEZAMA J, GUNDAVARAPU N B, et al. Language model beats diffusion-tokenizer is key to visual generation[J]. arXiv:2310.05737, 2023. [135] WU C, HUANG L, ZHANG Q, et al. GODIVA: generating open-domain videos from natural descriptions[J]. arXiv:2104.14806, 2021. [136] VILLEGAS R, BABAEIZADEH M, KINDERMANS P J, et al. Phenaki: variable length video generation from open domain textual descriptions[C]//Proceedings of the Eleventh International Conference on Learning Representations, 2022. [137] HU Y, LUO C, CHEN Z. Make it move: controllable image-to-video generation with text descriptions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 18219-18228. [138] HO J, CHAN W, SAHARIA C, et al. Imagen video: high definition video generation with diffusion models[J]. arXiv:2210.02303, 2022. [139] SINGER U, POLYAK A, HAYES T, et al. Make-A-Video: text-to-video generation without text-video data[C]//Proceedings of the Eleventh International Conference on Learning Representations, 2022. [140] ZHOU D, WANG W, YAN H, et al. MagicVideo: efficient video generation with latent diffusion models[J]. arXiv:2211. 11018, 2022. [141] HE Y, YANG T, ZHANG Y, et al. Latent video diffusion models for high-fidelity video generation with arbitrary lengths[J]. arXiv:2211.13221, 2022. [142] BLATTMANN A, ROMBACH R, LING H, et al. Align your latents: high-resolution video synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 22563-22575. [143] WANG Y, CHEN X, MA X, et al. LAVIE: high-quality video generation with cascaded latent diffusion models[J]. arXiv:2309.15103, 2023. [144] ZHANG D J, WU J Z, LIU J W, et al. Show-1: marrying pixel and latent diffusion models for text-to-video genera-tion[J]. arXiv:2309.15818, 2023. [145] LI X, CHU W, WU Y, et al. VideoGEN: a reference-guided latent diffusion approach for high definition text-to-video generation[J]. arXiv:2309.00398, 2023. [146] LUO Z, CHEN D, ZHANG Y, et al. VideoFusion: decom-posed diffusion models for high-quality video generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 10209-10218. [147] LIU Y, ZHANG K, LI Y, et al. Sora: a review on back-ground, technology, limitations, and opportunities of large vision models[J]. arXiv:2402.17177, 2024. [148] YELLAPRAGADA S, GRAIKOS A, PRASANNA P, et al. PathLDM: text conditioned latent diffusion model for histopathology[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024: 5182-5191. [149] SUN L, PENG W. MedSyn: text-guided anatomy-aware synthesis of high-fidelity 3D CT images[J]. arXiv:2310.03559v6, 2024. [150] HAMAMCI I E, ER S, SIMSAR E, et al. GenerateCT: text-guided 3D chest CT generation[J]. arXiv:2305.16037, 2023. [151] 汤健, 郭海涛, 夏恒, 等. 面向工业过程的图像生成及其应用研究综述[J]. 自动化学报, 2024, 50(2): 211-240. TANG J, GUO H T, XIA H, et al. Image generation and its application research for industrial process: a survey[J]. Acta Automatica Sinica, 2024, 50(2): 211-240. [152] 赵冠婕. 生成模型在园林设计中的应用研究[D]. 太原: 山西大学, 2023. ZHAO G J. Research on application of generative model in landscape design[D]. Taiyuan: Shanxi Univesity, 2023. [153] 赵宏, 李文改. 基于扩散生成对抗网络的文本生成图像模型研究[J]. 电子与信息学报, 2023, 45(12): 4371-4381. ZHAO H, LI W G. Text-to-image generation model based on diffusion wasserstein generative adversarial networks[J]. Journal of Electronics and Information Technology, 2023, 45(12): 4371-4381. [154] GU A, DAO T. Mamba: linear-time sequence modeling with selective state spaces[J]. arXiv:2312.00752, 2023. [155] HU M, ZHENG C, CHAM T J, et al. UniD3: unified discrete diffusion for simultaneous vision-language generation[J]. arXiv:2211.14842, 2022. [156] BAO F, NIE S, XUE K, et al. One transformer fits all distributions in multi-modal diffusion at scale[C]//Proceedings of the International Conference on Machine Learning, 2023: 1692-1717. |
| [1] | 丁文华, 杜军威, 侯磊, 刘金环. 基于生成对抗网络的时尚内容和风格迁移[J]. 计算机工程与应用, 2024, 60(9): 261-271. |
| [2] | 王磊, 杨军, 张驰宇, 代在燕. 结合混合注意力的双判别生成对抗网络[J]. 计算机工程与应用, 2024, 60(7): 212-221. |
| [3] | 曾凡智, 吴楚涛, 周燕. 跨域人脸活体检测的单边对抗网络算法[J]. 计算机工程与应用, 2024, 60(5): 103-111. |
| [4] | 江奕达, 王明明. 基于量子生成对抗网络的数据重构[J]. 计算机工程与应用, 2024, 60(5): 156-164. |
| [5] | 林本旺, 赵光哲, 王雪平, 李昊. 基于组残差块生成对抗网络的面部表情生成[J]. 计算机工程与应用, 2024, 60(5): 240-249. |
| [6] | 周粤川, 张建勋, 董文鑫, 高林枫, 倪锦园. 多尺度语义信息无监督山水画风格迁移网络[J]. 计算机工程与应用, 2024, 60(4): 258-269. |
| [7] | 谢天圻, 吴媛媛, 敬超, 孙伟恒. GAN模型生成图像检测方法综述[J]. 计算机工程与应用, 2024, 60(22): 74-86. |
| [8] | 许晓阳, 张梦飞. 融合LR编码网络和扩散模型的遥感图像超分辨率算法[J]. 计算机工程与应用, 2024, 60(22): 271-281. |
| [9] | 李文礼, 李超, 张祎楠, 宋越, 胡雄. 面向自动驾驶测试场景生成的博弈神经网络算法[J]. 计算机工程与应用, 2024, 60(22): 335-346. |
| [10] | 刘凡, 段先华, 胡维康. DL-GAN生成对抗网络的半监督语义分割模型[J]. 计算机工程与应用, 2024, 60(19): 221-229. |
| [11] | 张雅雯, 张玉臣, 吴越, 李程. 面向网络流量数据增强的生成对抗网络改进研究[J]. 计算机工程与应用, 2024, 60(18): 275-284. |
| [12] | 张鹤, 雷浩鹏, 王明文, 张尚昆. 基于注意力和动态记忆模块的文本图像生成方法[J]. 计算机工程与应用, 2024, 60(17): 224-232. |
| [13] | 谭润, 田启川, 廉露, 张晓行. 融合超分辨率重构的图像任意风格迁移[J]. 计算机工程与应用, 2024, 60(15): 170-179. |
| [14] | 李牧, 何金诚, 杨恒. 双分支GAN与注意力机制的火灾隐患检测算法[J]. 计算机工程与应用, 2024, 60(14): 228-239. |
| [15] | 翟社平, 亢鑫年, 李方怡, 杨锐. 融合关系路径与实体邻域信息的知识图谱补全方法[J]. 计算机工程与应用, 2024, 60(13): 136-142. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||