Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (12): 218-225.DOI: 10.3778/j.issn.1002-8331.2011-0025

• Graphics and Image Processing • Previous Articles     Next Articles

Image Caption Combining Global-Local Features and Attention

XIE Qibin, CHEN Pinghua   

  1. School of Computers, Guangdong University of Technology, Guangzhou 510006, China
  • Online:2022-06-15 Published:2022-06-15



  1. 广东工业大学 计算机学院,广州 510006

Abstract: In order to further improve the accuracy of text generated by image description, this paper proposes an image description method that combines global-local features and attention mechanism. This method is an improvement on the traditional encoder-decoder model. From an overall perspective, the encoder stage uses the residual network ResNet101 to extract the global and local features of the image to avoid object loss or object prediction errors. In the processor stage, a two-way GRU embedded with an improved attention mechanism is used to generate text sequences. From a local point of view, the attention mechanism proposed by this model is an independent loop structure. The attention weight is obtained by calculating the similarity between the image local feature vector and the semantic vector, and the mapping relationship between image features and semantic information is enhanced. The experimental results on the MSCOCO dataset show that the algorithm in this paper has achieved varying degrees of improvement in evaluation indicators such as BLEU, CIDEr, and METEOR, indicating that the description text generated by this model is highly accurate and rich in details.

Key words: image caption, attetion mechanism, encoder-decoder framework, global features, local features

摘要: 为了进一步提高图像描述生成文本的精度,提出一种结合全局-局部特征和注意力机制的图像描述方法。该方法在传统的编码器-解码器模型上进行改进,从整体角度来看,编码器阶段使用残差网络ResNet101提取图像的全局特征和局部特征,以避免对象丢失或对象预测错误问题,在解码器阶段采用嵌入改进后的注意力机制的双向[GRU]生成文本序列。从局部角度来看,该模型提出的注意力机制是一种独立的循环结构,通过计算图像局部特征向量与语义向量之间的相似度来获取注意力权重,增强图像特征与语义信息之间的映射。在MSCOCO数据集上的实验结果显示,该算法在BLEU、CIDEr、METEOR等评价指标上均获得了不同程度的提升,表明使用该模型生成的描述文本准确度高且细节丰富。

关键词: 图像描述, 注意力机制, 编码器-解码器框架, 全局特征, 局部特征