Image Caption Combining Global-Local Features and Attention

doi:10.3778/j.issn.1002-8331.2011-0025

Abstract

Abstract: In order to further improve the accuracy of text generated by image description, this paper proposes an image description method that combines global-local features and attention mechanism. This method is an improvement on the traditional encoder-decoder model. From an overall perspective, the encoder stage uses the residual network ResNet101 to extract the global and local features of the image to avoid object loss or object prediction errors. In the processor stage, a two-way GRU embedded with an improved attention mechanism is used to generate text sequences. From a local point of view, the attention mechanism proposed by this model is an independent loop structure. The attention weight is obtained by calculating the similarity between the image local feature vector and the semantic vector, and the mapping relationship between image features and semantic information is enhanced. The experimental results on the MSCOCO dataset show that the algorithm in this paper has achieved varying degrees of improvement in evaluation indicators such as BLEU, CIDEr, and METEOR, indicating that the description text generated by this model is highly accurate and rich in details.

Key words: image caption, attetion mechanism, encoder-decoder framework, global features, local features

摘要： 为了进一步提高图像描述生成文本的精度，提出一种结合全局-局部特征和注意力机制的图像描述方法。该方法在传统的编码器-解码器模型上进行改进，从整体角度来看，编码器阶段使用残差网络ResNet101提取图像的全局特征和局部特征，以避免对象丢失或对象预测错误问题，在解码器阶段采用嵌入改进后的注意力机制的双向[GRU]生成文本序列。从局部角度来看，该模型提出的注意力机制是一种独立的循环结构，通过计算图像局部特征向量与语义向量之间的相似度来获取注意力权重，增强图像特征与语义信息之间的映射。在MSCOCO数据集上的实验结果显示，该算法在BLEU、CIDEr、METEOR等评价指标上均获得了不同程度的提升，表明使用该模型生成的描述文本准确度高且细节丰富。

关键词: 图像描述, 注意力机制, 编码器-解码器框架, 全局特征, 局部特征

XIE Qibin, CHEN Pinghua. Image Caption Combining Global-Local Features and Attention[J]. Computer Engineering and Applications, 2022, 58(12): 218-225.

谢琦彬, 陈平华. 结合全局-局部特征和注意力的图像描述方法[J]. 计算机工程与应用, 2022, 58(12): 218-225.

References

[1] FARHADI A，HEJRATI S M M，SADEGHI M A，et al.Every picture tells a story：generating sentences from images[C]//European Conference on Computer Vision，2010.
[2] KULKARNI G，PREMRAJ V，ORDONEZ V，et al.BabyTalk：understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis & Machine Intelligence，2013，35（12）：2891-2903.
[3] FANG H，GUPTA S，IANDOLA F，et al.From captions to visual concepts and back[C]//IEEE Conference on Computer Vision and Pattern Recognition，2015：1473-1482.
[4] MITCHELL M，HAN X，DODGE J，et al.Midge：generating image descriptions from computer vision detections[C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics，2012：747-756.
[5] DEVLIN J，CHENG H，FANG H，et al.Language models for image captioning：the quirks and what works[J].arXiv：1505.01809，2015.
[6] MAO J H，XU W，YANG Y，et al.Explain images with multimodal recurrent neural networks[J].arXiv：1410.1090，2014.
[7] VINYALS O，TOSHEV A，BENGIO S，et al.Show and tell：a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，Boston，USA，2015：3156-3164.
[8] HOCHREITER S，SCHMIDHUBER J.Long short term memory[J].Neural Computation，1997，9（8）：1735-1780.
[9] JIA X，GAVVES E，FERNANDO B，et al.Guiding the long-short term memorymodel for image caption generation[C]//IEEE International Conference on Computer Vision，2016.
[10] XU K，IMMY L B.Show，attend and tell：neural image caption generationwith visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning，Lille，France，2015：2048-2057.
[11] LU J，XIONG C，PARIKH D，et al.Knowing when to look：adaptive attention via a visual sentinel for image captioning[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2017.
[12] HE K M，ZHANG X Y，REN S Q，et al.Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[13] ATLIHA V，SESOK D.Comparison of VGG and ResNet used as encoders for image captioning[C]//2020 IEEE Open Conference of Electrical，Electronic and Information Sciences（eStream），2020.
[14] LIN C Y，HOVY E.Automatic evaluation of summaries using [N]-gram co-occurrence statistics[C]//Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology，2003.
[15] SATANJEEV B.METEOR：an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization，2005：228-231.
[16] VEDANTAM R，ZITNICK C L，PARIKH D.CIDEr：consensus-based image description evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2015：4566-4575.
[17] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft COCO：common objects in context[J].arXiv：1405.0312，2014.
[18] DIEDERIK P.Adam：a method for sthastic optimization[J].arXiv：1412.6980，2014.
[19] WANG P，NG H T.A beam-search decoder for normalization of social media text with application to machine translation[C]//Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2013.
[20] 靳华中，刘潇龙，胡梓珂.一种结合全局和局部特征的图像描述生成模型[J].应用科学学报，2019，37（4）：501-509.
JIN H Z，LIU X L，HU Z K.An image caption generation model combining global and local features[J].Journal of Applied Sciences，2019，37（4）：501-509.
[21] LI L H，TANG S，ZHANG Y D，et al.GLA：global-local attention for image description[J].IEEE Transactions on Multimedia，2018，20（3）：726-737.