多尺度特征融合的图像描述算法

doi:10.3778/j.issn.1002-8331.2311-0289

摘要/Abstract

摘要： 针对现有图像描述算法提取的图像特征信息不全面、编码器和解码器模型不统一的问题，提出了多尺度特征融合的图像描述算法。通过多尺度全局特征提取模块和区域特征提取模块分别得到图像的多尺度全局特征和区域特征，通过特征融合模块获得融合后的视觉特征，送入Transformer模型的编码器完成特征编码，通过Transformer模型的解码器生成图像描述内容。通过在MS-COCO数据集上进行实验，并且与当前的一些主流算法进行比较，实验结果表明，所提出的算法在CIDEr关键指标上得分为127.2%，比主流算法提高了3.5个百分点，其余指标也有不同程度的提高。同时，消融实验验证了算法的有效性，定性分析表明了所提出算法能够生成更准确更详细的图像描述。

关键词: 图像描述, 多尺度全局特征, 区域特征, Transformer

Abstract: Aiming at the problem that the features information extracted by existing image captioning algorithms is not comprehensive and the encoder and decoder models are not uniform, this paper proposes an image captioning algorithm for multi-scale features fusion. First, the multi-scale global and regional features of the image are obtained through the multi-scale global feature extraction module and the regional feature extraction module respectively, then, the fused visual features are obtained through the feature fusion module, which is sent to the encoder of the Transformer model for feature encoding, and finally, the image captioning content is generated through the decoder of the Transformer model. By conducting experiments on the MS-COCO dataset and comparing with some current mainstream algorithms, the experimental results show that the proposed algorithm scores 127.2% on the key index of CIDEr, which is 3.5 percentage points better than the mainstream algorithm, and the other indexes are also improved to varying degrees. At the same time, the ablation experiment verifies the effectiveness of the proposed algorithm, and the qualitative analysis shows that the proposed algorithm can generate more accurate and detailed image captioning.

Key words: image captioning, multi-scale global features, region features, Transformer

白雪冰, 车进, 吴金蔓. 多尺度特征融合的图像描述算法[J]. 计算机工程与应用, 2025, 61(7): 288-296.

BAI Xuebing, CHE Jin, WU Jinman. Image Captioning Algorithm for Multi-Scale Features Fusion[J]. Computer Engineering and Applications, 2025, 61(7): 288-296.

参考文献

[1] WANG D F, HU H F, CHEN D H. Transformer with sparse self-attention mechanism for image captioning[J]. Electronics Letters, 2020, 56(15): 764-766.
[2] LI Z X, LIN L, ZHANG C L, et al. A semi-supervised learning approach based on adaptive weighted fusion for automatic image annotation[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2021, 17(1): 1-23.
[3] HUANG L, WANG W M, CHEN J, et al. Attention on attention for image captioning[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 4634-4643.
[4] WU M R, ZHANG X Y, SUN X S, et al. DIFNet: boosting visual information flow for image captioning[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 18020-18029.
[5] YANG X, ZHANG H W, GAO C Y, et al. Learning to collocate visual-linguistic neural modules for image captioning[J]. International Journal of Computer Vision, 2023, 131(1): 82-100.
[6] CHEN F L, ZHANG D Z, HAN M L, et al. VLP: a survey on vision-language pre-training[J]. Machine Intelligence Research, 2023, 20(1): 38-56.
[7] 杨文瑞, 沈韬, 朱艳, 等. 融合ELMo词嵌入的多模态Transformer的图像描述算法[J]. 计算机工程与应用, 2022, 58(21): 223-231.
YANG W R, SHEN T, ZHU Y, et al. Image caption with ELMo embedding and multimodal transformer[J]. Computer Engineering and Applications, 2022, 58(21): 223-231.
[8] GUO L T, LIU J, ZHU X X, et al. Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 10327-10336.
[9] XU S, LI Y J, LIN M B, et al. Q-DETR: an efficient low-bit quantized detection transformer[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 3842-3851.
[10] KAMANGAR Z U, SHAIKH G M, HASSAN S, et al. Image caption generation related to object detection and colour recognition using transformer-decoder[C]//Proceedings of the 2023 4th International Conference on Computing, Mathematics and Engineering Technologies. Piscataway: IEEE, 2023: 1-5.
[11]    ORIOL V, ALEXANDER T, SAMY B, et al. Show and tell: a neural image caption generator[J]. arXiv:1411.4555, 2014.
[12] MARZOUK R, ALABDULKREEM E, NOUR M K, et al. Natural language processing with optimal deep learning-enabled intelligent image captioning system[J]. Computers, Materials & Continua, 2023, 74(2): 4435-4451.
[13] XU K, BA J L, KIROS R, et al. Show, attend and tell: neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning. New York: ACM, 2015: 2048-2057.
[14] 刘茂福, 施琦, 聂礼强. 基于视觉关联与上下文双注意力的图像描述生成方法[J]. 软件学报, 2022, 33(9): 3210-3222.
LIU M F, SHI Q, NIE L Q. Image captioning based on visual relevance and context dual attention[J]. Journal of Software, 2022, 33(9): 3210-3222.
[15] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2018: 6077-6086.
[16] DU S, ZHU H, LIN G F, et al. Object semantic analysis for image captioning[J]. Multimedia Tools and Applications, 2023, 82(28): 43179-43206.
[17] 卓亚琦, 魏家辉, 李志欣. 基于双注意模型的图像描述生成方法研究[J]. 电子学报, 2022, 50(5): 1123-1130.
ZHUO Y Q, WEI J H, LI Z X. Research on image captioning based on double attention model[J]. Acta Electronica Sinica, 2022, 50(5): 1123-1130.
[18] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. arXiv:1706.03762, 2017.
[19] RAUNAK V, MENEZES A, POST M, et al. Do GPTs produce less literal translations?[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 1041-1050.
[20] HEWITT J, THICKSTUN J, MANNING C, et al. Backpack language models[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 9103-9125.
[21]    DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[22] GAO S Y, ZHOU C L, ZHANG J. Generalized relation modeling for transformer tracking[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 18686-18695.
[23] 季瑞瑞, 谢宇辉, 骆丰凯, 等. 改进视觉Transformer的人脸识别方法[J]. 计算机工程与应用, 2023, 59(8): 117-126.
JI R R, XIE Y H, LUO F K, et al. Face recognition method based on improved visual Transformer[J]. Computer Engineering and Applications, 2023, 59(8): 117-126.
[24] LIU X Y, PENG H W, ZHENG N X, et al. EfficientViT: memory efficient vision transformer with cascaded group attention[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 14420-14430.
[25] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-critical sequence training for image captioning[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2017: 1179-1195.
[26] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2014: 740-755.
[27] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 3128-3137.
[28] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. New York: ACM, 2002: 311-318.
[29] LIN C Y. Automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out, 2004.
[30] VEDANTAM R, ZITNICK C L, PARIKH D. CIDEr: consensus-based image description evaluation[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 4566-4575.
[31] 唐渔, 何志琴, 周宇辉, 等. 基于Se-ResNet50特征编码器的公共环境图像描述生成[J]. 计算机应用研究, 2023, 40(6): 1864-1869.
TANG Y, HE Z Q, ZHOU Y H, et al. Public environment image caption generation based on Se-ResNet-50 feature encoder[J]. Application Research of Computers, 2023, 40(6): 1864-1869.
[32]    HERDADE S, KAPPELER A, BOAKYE K, et al. Image caption: transforming objects into words[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 11137-11147.
[33] ZHANG J, LI K K, WANG Z. Parallel-fusion LSTM with synchronous semantic and visual information for image captioning[J]. Journal of Visual Communication and Image Representation, 2021, 75: 103044.
[34] ZHANG Z J, WU Q, WANG Y, et al. Exploring region relationships implicitly: image captioning with visual relationship attention[J]. Image and Vision Computing, 2021, 109: 104146.
[35] PEI H L, CHEN Q H, WANG J, et al. Visual relational reasoning for image caption[C]//Proceedings of the 2020 International Joint Conference on Neural Networks. Piscataway: IEEE, 2020: 1-8.