Image Caption with ELMo Embedding and Multimodal Transformer

doi:10.3778/j.issn.1002-8331.2103-0426

Abstract

Abstract: The task of image caption is aim to generate the corresponding description of a given image. In order to solve the problem of incomplete understanding of semantic information in existing algorithms, a multimodal Transformer model for image description is proposed. In the attention module, the model captures the interaction within and between modes simultaneously, and further uses ELMo to obtain word embeddings which containing context information, so that the model can obtain more rich semantic description as input. This model can better understand and infer complex multimodal information and generate more accurate natural language description. The model has been widely tested on Microsoft COCO dataset, and the experimental results show that it has a great improvement compared with the baseline model using bottom-up attention and LSTM. The model has an improvement of 0.7, 0.4, 0.9, 1.3, 0.6, 4.9 percentage points on BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, CIDEr-D respectively.

Key words: Transformer, image caption, ELMo, attention mechanism

摘要： 图像描述任务旨在针对一张给出的图像产生其对应描述。针对现有算法中语义信息理解不够全面的问题，提出了一个针对图像描述领域的多模态Transformer模型。该模型在注意模块中同时捕捉模态内和模态间的相互作用；更进一步使用ELMo获得包含上下文信息的文本特征，使模型获得更加丰富的语义描述输入。该模型可以对复杂的多模态信息进行更好地理解与推断并且生成更为准确的自然语言描述。该模型在Microsoft COCO数据集上进行了广泛的实验，实验结果表明，相比于使用bottom-up注意力机制以及LSTM进行图像描述的基线模型具有较大的效果提升，模型在BLEU-1、BLEU-2、BLEU-3、BLEU-4、ROUGE-L、CIDEr-D上分别有0.7、0.4、0.9、1.3、0.6、4.9个百分点的提高。

关键词: Transformer, 图像描述, ELMo, 注意力机制

YANG Wenrui, SHEN Tao, ZHU Yan, ZENG Kai, LIU Yingli. Image Caption with ELMo Embedding and Multimodal Transformer[J]. Computer Engineering and Applications, 2022, 58(21): 223-231.

杨文瑞, 沈韬, 朱艳, 曾凯, 刘英莉. 融合ELMo词嵌入的多模态Transformer的图像描述算法[J]. 计算机工程与应用, 2022, 58(21): 223-231.

References

[1] YU Z，WU F，YANG Y，et al.Discriminative coupled dictionary hashing for fast cross-media retrieval[C]//The 37th International ACM SIGIR Conference on Research and Development in Information Retrieval，2014：395-404.
[2] YU Z，YU J，FAN J，et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：1821-1830.
[3] XU H，SAENKO K.Ask，attend and answer：exploring question-guided spatial attention for visual question answering[C]//European Conference on Computer Vision.Cham：Springer，2016：451-466.
[4] LU J，XIONG C，PARIKH D，et al.Knowing when to look：adaptive attention via a visual sentinel for image captioning[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2017.
[5] ANDERSON P，HE X，BUEHLER C，et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：6077-6086.
[6] CHEN L，ZHANG H，XIAO J，et al.SCA-CNN：spatial and channel-wise attention in convolutional networks for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：5659-5667.
[7] RENNIE S J，MARCHERET E，MROUEH Y，et al.Self-critical sequence training for image captioning[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition（CVPR），2017.
[8] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781，2013.
[9] PENNINGTON J，SOCHER R，MANNING C D.Glove：global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing（EMNLP），2014：1532-1543.
[10] KULKARNI G，PREMRAJ V，DHAR S，et al.BabyTalk：understanding and generating simple image descriptions[C]//Conference on Computer Vision and Pattern Recognition，2013.
[11] MITCHELL M，DODGE J，GOYAL A，et al.Midge：generating image descriptions from computer vision detections[C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics，2012：747-756.
[12] YANG Y，TEO C，DAUMé III H，et al.Corpus-guided sentence generation of natural images[C]//Proceedings of the 2011 Conference on Empirical Method in Natural Language Processing，2011：444-454.
[13] KARPATHY A，JOULIN A，LI F F.Deep fragment embeddings for bidirectional image sentence mapping[J].arXiv：1406.5679，2014.
[14] FARHADI A，HEJRATI S M M，SADEGHI M A，et al.Every picture tells a story：generating sentences from images[C]//European Conference on Computer Vision，2010.
[15] DEVLIN J，CHENG H，FANG H，et al.Language models for image captioning：the quirks and what works[J].arXiv：1505.01809，2015.
[16] YAO T，PAN Y，LI Y，et al.Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：684-699.
[17] SZEGEDY C，LIU W，JIA Y，et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：1-9.
[18] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural Computation，1997，9（8）：1735-1780.
[19] XU K，BA J，KIROS R，et al.Show，attend and tell：neural image caption generation with visual attention[C]//International Conference on Machine Learning，2015：2048-2057.
[20] REN S，HE K，GIRSHICK R，et al.Faster R-CNN：towards real-time object detection with region proposal networks[J].arXiv：1506.01497，2015.
[21] PETERS M E，NEUMANN M，IYYER M，et al.Deep contextualized word representations[J].arXiv：1802.05365，2018.
[22] LIN T Y，MAIRE M，BELONGIE S，et al.Microsoft coco：common objects in context[C]//European Conference on Computer Vision.Cham：Springer，2014：740-755.
[23] KARPATHY A，LI F F.Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：3128-3137.
[24] PAPINENI K，ROUKOS S，WARD T，et al.BLEu：a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics，2002：311-318.
[25] LIN C Y.Automatic evaluation of summaries[C]//Workshop on Text Summarization Branches Out at ACL，2004.
[26] VEDANTAM R，LAWRENCE ZITNICK C，PARIKH D.Cider：consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：4566-4575.
[27] VINYALS O，TOSHEV A，BENGIO S，et al.Show and tell：a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2015：3156-3164.
[28] YAO T，PAN Y，LI Y，et al.Boosting image captioning with attributes[C]//Proceedings of the IEEE International Conference on Computer Vision，2017：4894-4902.