计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (15): 229-240.DOI: 10.3778/j.issn.1002-8331.2405-0044

• 模式识别与人工智能 • 上一篇    下一篇

基于语义增强和特征融合的文本生成图像方法

吴昊文,王鹏,李亮亮,邸若海,李晓艳,吕志刚   

  1. 1.西安工业大学 电子信息工程学院,西安 710021
    2.西安工业大学 发展规划处,西安 710021
    3.西安工业大学 机电工程学院,西安 710021
  • 出版日期:2025-08-01 发布日期:2025-07-31

Text-to-Image Generation Method Based on Semantic Enhancement and Feature Fusion

WU Haowen, WANG Peng, LI Liangliang, DI Ruohai, LI Xiaoyan, LYU Zhigang   

  1. 1.School of Electronic Information Engineering, Xi’an Technological University, Xi’an 710021, China
    2.Development Planning Service, Xi’an Technological University, Xi’an 710021, China
    3.School of Mechatronic Engineering, Xi’an Technological University, Xi’an 710021, China
  • Online:2025-08-01 Published:2025-07-31

摘要: 文本生成图像是机器学习领域中非常具有挑战性的任务,虽然目前已有很大的突破,但仍然存在图像细粒度不够和语义一致性弱的问题,因此提出了一种基于语义增强和特征融合的文本生成图像方法(SEF-GAN)。针对初始特征表征不足问题,提出了空间交叉重建模块,对不同信息量特征图进行分离与交叉重建,获得更精细化特征。为了提高文本属性信息的有效利用表征,设计了语义关联注意力模块,提高了文本描述和视觉内容之间的语义一致性。为了充分利用图像区域特征与文本语义标签之间的隐藏联系,构建了通道特征融合模块,将区域图像特征与文本隐层特征进行仿射,对目标区域重构并保留图像中与文本无关内容,并连接反残差结构进一步增强特征表达能力。在CUB和COCO数据集上实验结果表明,相对于现有先进方法,该方法将IS指标分别提高了18.8%和6.3%,FID指标分别提高了33.9%和14.6%,RP指标分别提高了10.9%和3.3%。证实所提方法能有效生成细节更丰富的图像,与文本描述更加吻合。

关键词: 文本生成图像, 生成对抗网络, 属性特征学习, 图像语义融合, 反残差结构

Abstract: Text-to-image generation is a very challenging task in the field of machine learning. Despite significant breakthroughs, there are still issues of insufficient image granularity and weak semantic consistency. Therefore, a text-to-image generation method based on semantic enhancement and feature fusion (SEF-GAN) is proposed. Firstly, a spatial cross reconstruction module is presented to address the issue of insufficient initial feature representation, which separates and cross reconstructs feature maps with different information contents to obtain more refined features. Secondly, to improve the effective utilization and representation of text attribute information, a semantic association attention module is designed to enhance the semantic consistency between text description and visual content. Finally, to fully utilize the hidden connection between image region features and text semantic labels, a channel feature fusion module is established. This module associates regional image features with text hidden layer features, reconstructs the target area while retaining content unrelated to the text in the image, and connects anti-residual structures to further enhance feature expression capabilities. The experimental results on CUB and COCO datasets show that compared with the existing advanced methods, the proposed method increases the IS index by 18.8% and 6.3%, the FID index by 33.9% and 14.6%, and the RP index by 10.9% and 3.3%, respectively. This confirms that the proposed method effectively generates images with richer details and is more consistent with text descriptions.

Key words: text generated image, generative adversarial network, attribute feature learning, image semantic fusion, inverse residual structure