计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (7): 288-296.DOI: 10.3778/j.issn.1002-8331.2311-0289

• 图形图像处理 • 上一篇    下一篇

多尺度特征融合的图像描述算法

白雪冰,车进,吴金蔓   

  1. 1.宁夏大学 前沿交叉学院,宁夏 中卫 755000
    2.宁夏大学 宁夏沙漠信息智能感知重点实验室,银川 750021
    3.宁夏大学 电子与电气工程学院,银川 750021
  • 出版日期:2025-04-01 发布日期:2025-04-01

Image Captioning Algorithm for Multi-Scale Features Fusion

BAI Xuebing, CHE Jin, WU Jinman   

  1. 1.School of Advanced Interdisciplinary, Ningxia University, Zhongwei, Ningxia 755000, China
    2.Ningxia Key Laboratory of Intelligent Sensing for Desert Information, Ningxia University, Yinchuan 750021, China
    3.School of Electronic and Electrical Engineering, Ningxia University, Yinchuan 750021, China
  • Online:2025-04-01 Published:2025-04-01

摘要: 针对现有图像描述算法提取的图像特征信息不全面、编码器和解码器模型不统一的问题,提出了多尺度特征融合的图像描述算法。通过多尺度全局特征提取模块和区域特征提取模块分别得到图像的多尺度全局特征和区域特征,通过特征融合模块获得融合后的视觉特征,送入Transformer模型的编码器完成特征编码,通过Transformer模型的解码器生成图像描述内容。通过在MS-COCO数据集上进行实验,并且与当前的一些主流算法进行比较,实验结果表明,所提出的算法在CIDEr关键指标上得分为127.2%,比主流算法提高了3.5个百分点,其余指标也有不同程度的提高。同时,消融实验验证了算法的有效性,定性分析表明了所提出算法能够生成更准确更详细的图像描述。

关键词: 图像描述, 多尺度全局特征, 区域特征, Transformer

Abstract: Aiming at the problem that the features information extracted by existing image captioning algorithms is not comprehensive and the encoder and decoder models are not uniform, this paper proposes an image captioning algorithm for multi-scale features fusion. First, the multi-scale global and regional features of the image are obtained through the multi-scale global feature extraction module and the regional feature extraction module respectively, then, the fused visual features are obtained through the feature fusion module, which is sent to the encoder of the Transformer model for feature encoding, and finally, the image captioning content is generated through the decoder of the Transformer model. By conducting experiments on the MS-COCO dataset and comparing with some current mainstream algorithms, the experimental results show that the proposed algorithm scores 127.2% on the key index of CIDEr, which is 3.5 percentage points better than the mainstream algorithm, and the other indexes are also improved to varying degrees. At the same time, the ablation experiment verifies the effectiveness of the proposed algorithm, and the qualitative analysis shows that the proposed algorithm can generate more accurate and detailed image captioning.

Key words: image captioning, multi-scale global features, region features, Transformer