基于语义增强和特征融合的文本生成图像方法

doi:10.3778/j.issn.1002-8331.2405-0044

摘要/Abstract

摘要： 文本生成图像是机器学习领域中非常具有挑战性的任务，虽然目前已有很大的突破，但仍然存在图像细粒度不够和语义一致性弱的问题，因此提出了一种基于语义增强和特征融合的文本生成图像方法（SEF-GAN）。针对初始特征表征不足问题，提出了空间交叉重建模块，对不同信息量特征图进行分离与交叉重建，获得更精细化特征。为了提高文本属性信息的有效利用表征，设计了语义关联注意力模块，提高了文本描述和视觉内容之间的语义一致性。为了充分利用图像区域特征与文本语义标签之间的隐藏联系，构建了通道特征融合模块，将区域图像特征与文本隐层特征进行仿射，对目标区域重构并保留图像中与文本无关内容，并连接反残差结构进一步增强特征表达能力。在CUB和COCO数据集上实验结果表明，相对于现有先进方法，该方法将IS指标分别提高了18.8%和6.3%，FID指标分别提高了33.9%和14.6%，RP指标分别提高了10.9%和3.3%。证实所提方法能有效生成细节更丰富的图像，与文本描述更加吻合。

关键词: 文本生成图像, 生成对抗网络, 属性特征学习, 图像语义融合, 反残差结构

Abstract: Text-to-image generation is a very challenging task in the field of machine learning. Despite significant breakthroughs, there are still issues of insufficient image granularity and weak semantic consistency. Therefore, a text-to-image generation method based on semantic enhancement and feature fusion (SEF-GAN) is proposed. Firstly, a spatial cross reconstruction module is presented to address the issue of insufficient initial feature representation, which separates and cross reconstructs feature maps with different information contents to obtain more refined features. Secondly, to improve the effective utilization and representation of text attribute information, a semantic association attention module is designed to enhance the semantic consistency between text description and visual content. Finally, to fully utilize the hidden connection between image region features and text semantic labels, a channel feature fusion module is established. This module associates regional image features with text hidden layer features, reconstructs the target area while retaining content unrelated to the text in the image, and connects anti-residual structures to further enhance feature expression capabilities. The experimental results on CUB and COCO datasets show that compared with the existing advanced methods, the proposed method increases the IS index by 18.8% and 6.3%, the FID index by 33.9% and 14.6%, and the RP index by 10.9% and 3.3%, respectively. This confirms that the proposed method effectively generates images with richer details and is more consistent with text descriptions.

Key words: text generated image, generative adversarial network, attribute feature learning, image semantic fusion, inverse residual structure

吴昊文, 王鹏, 李亮亮, 邸若海, 李晓艳, 吕志刚. 基于语义增强和特征融合的文本生成图像方法[J]. 计算机工程与应用, 2025, 61(15): 229-240.

WU Haowen, WANG Peng, LI Liangliang, DI Ruohai, LI Xiaoyan, LYU Zhigang. Text-to-Image Generation Method Based on Semantic Enhancement and Feature Fusion[J]. Computer Engineering and Applications, 2025, 61(15): 229-240.

参考文献

[1] LIU Y C, SHU Z X, LI Y J, et al. Content-aware GAN compression[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 12151-12161.
[2] YANG Z, HH Z, SALAKHUTDINOV R, et al. Improved variational autoencoders for text modeling using dilated convolutions[C]//Proceedings of the 34th International Conference on Machine Learning, 2017: 3881-3890.
[3] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[J]. arXiv:1409. 0473, 2014.
[4] GOODFELLOW I, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial networks[J]. Communications of the ACM, 2020, 63(11): 139-144.
[5] REED S, AKATA Z, YAN X, et al. Generative adversarial text to image synthesis[C]//Proceedings of the 33rd International Conference on Machine Learning, 2016: 1060-1069.
[6] ZHANG H, XU T, LI H S, et al. StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision. Piscataway: IEEE, 2017: 5908-5916.
[7] ZHANG H, XU T, LI H S, et al. StackGAN++: realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(8): 1947-1962.
[8] XU T, ZHANG P C, HUANG Q Y, et al. AttnGAN: fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 1316-1324.
[9] REED S, AKATA Z, MOHAN S, et al. Learning what and where to draw[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016: 217-225.
[10] TAN H C, LIU X P, LI X, et al. Semantics-enhanced adversarial nets for text-to-image synthesis[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 10500-10509.
[11] LI B, QI X, LUKASIEWICZ T, et al. Controllable text-to-image generation[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 2065-2075.
[12] ZHU M F, PAN P B, CHEN W, et al. DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 5795-5803.
[13] QIAO T T, ZHANG J, XU D Q, et al. MirrorGAN: learning text-to-image generation by redescription[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 1505-1514.
[14] CHENG J, WU F X, TIAN Y L, et al. RiFeGAN: rich feature generation for text-to-image synthesis from prior knowledge[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10908-10917.
[15] YANG Y H, WANG L, XIE D, et al. Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis[J]. IEEE Transactions on Image Processing, 2021, 30: 2798-2809.
[16] LIAO W T, HU K, YANG M Y, et al. Text to image generation with semantic-spatial aware GAN[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 18166-18175.
[17] 莫建文, 徐凯亮, 林乐平, 等. 结合互信息最大化的文本到图像生成方法[J]. 西安电子科技大学学报, 2019, 46(5): 180-188.
MO J W, XU K L, LIN L P, et al. Text-to-image generation combined with mutual information maximization[J]. Journal of Xidian University, 2019, 46(5): 180-188.
[18] 孙钰, 李林燕, 叶子寒, 等. 多层次结构生成对抗网络的文本生成图像方法[J]. 计算机应用, 2019, 39(11): 3204-3209.
SUN Y, LI L Y, YE Z H, et al. Text-to-image synthesis method based on multi-level structure generative adversarial networks[J]. Journal of Computer Applications, 2019, 39(11): 3204-3209.
[19] YIN G J, LIU B, SHENG L, et al. Semantics disentangling for text-to-image generation[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 2322-2331.
[20] LI J F, WEN Y, HE L H. SCConv: spatial and channel reconstruction convolution for feature redundancy[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 6153-6162.
[21] WU Y, HE K. Group normalization[C]//Proceedings of the European Conference on Computer Vision, 2018: 3-19.
[22] SCHUSTER M, PALIWAL K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681.
[23] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84-90.
[24] SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 2818-2826.
[25] SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 4510-4520.
[26] BALLES L, HENNIG P. Dissecting adam: the sign, magnitude and variance of stochastic gradients[C]//Proceedings of the International Conference on Machine Learning, 2018: 404-413.
[27] WAH C, BRANSON S, WELINDER P, et al. The Caltech-USCD Birds-200-2011 dataset[R]. California Institute of Technology, 2011.
[28] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[29] SALIMANS T, GOODFELLOW I, ZAREMBA W, et al. Improved techniques for training GANs[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016: 2234-2242.
[30] HEUSEL M, RAMSAUER H, UNTERTHINER T, et al. GANs trained by a two time-scale update rule converge to a local nash equilibrium[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6629-6640.
[31] VAN ERVEN T, HARREMOS P. Rényi divergence and kullback-leibler divergence[J]. IEEE Transactions on Information Theory, 2014, 60(7): 3797-3820.
[32] TAO M, TANG H, WU F, et al. DF-GAN: a simple and effective baseline for text-to-image synthesis[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 16494-16504.
[33] PENG D L, YANG W C, LIU C, et al. SAM-GAN: self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis[J]. Neural Networks, 2021, 138: 57-67.
[34] ZHANG Z X, SCHOMAKER L. DiverGAN: an efficient and effective single-stage framework for diverse text-to-image generation[J]. Neurocomputing, 2022, 473: 182-198.
[35] TAN H C, LIU X P, YIN B C, et al. Cross-modal semantic matching generative adversarial networks for text-to-image synthesis[J]. IEEE Transactions on Multimedia, 2021, 24: 832-845.
[36] LAZCANO D, FRANCO N F, CREIXELL W. HGAN: hyperbolic generative adversarial network[J]. IEEE Access, 2021, 9: 96309-96320.
[37] QU E, ZOU D. Autoencoding hyperbolic representation for adversarial generation[J]. arXiv: 2201. 12825, 2022.
[38] KANG M, ZHU J Y, ZHANG R, et al. Scaling up GANs for text-to-image synthesis[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 10124-10134.