[1] FARHADI A,HEJRATI S M M,SADEGHI M A,et al.Every picture tells a story:generating sentences from images[C]//European Conference on Computer Vision,2010.
[2] KULKARNI G,PREMRAJ V,ORDONEZ V,et al.BabyTalk:understanding and generating simple image descriptions[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2013,35(12):2891-2903.
[3] FANG H,GUPTA S,IANDOLA F,et al.From captions to visual concepts and back[C]//IEEE Conference on Computer Vision and Pattern Recognition,2015:1473-1482.
[4] MITCHELL M,HAN X,DODGE J,et al.Midge:generating image descriptions from computer vision detections[C]//Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics,2012:747-756.
[5] DEVLIN J,CHENG H,FANG H,et al.Language models for image captioning:the quirks and what works[J].arXiv:1505.01809,2015.
[6] MAO J H,XU W,YANG Y,et al.Explain images with multimodal recurrent neural networks[J].arXiv:1410.1090,2014.
[7] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:a neural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Boston,USA,2015:3156-3164.
[8] HOCHREITER S,SCHMIDHUBER J.Long short term memory[J].Neural Computation,1997,9(8):1735-1780.
[9] JIA X,GAVVES E,FERNANDO B,et al.Guiding the long-short term memorymodel for image caption generation[C]//IEEE International Conference on Computer Vision,2016.
[10] XU K,IMMY L B.Show,attend and tell:neural image caption generationwith visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning,Lille,France,2015:2048-2057.
[11] LU J,XIONG C,PARIKH D,et al.Knowing when to look:adaptive attention via a visual sentinel for image captioning[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2017.
[12] HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,2016:770-778.
[13] ATLIHA V,SESOK D.Comparison of VGG and ResNet used as encoders for image captioning[C]//2020 IEEE Open Conference of Electrical,Electronic and Information Sciences(eStream),2020.
[14] LIN C Y,HOVY E.Automatic evaluation of summaries using [N]-gram co-occurrence statistics[C]//Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology,2003.
[15] SATANJEEV B.METEOR:an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization,2005:228-231.
[16] VEDANTAM R,ZITNICK C L,PARIKH D.CIDEr:consensus-based image description evaluation[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2015:4566-4575.
[17] LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:common objects in context[J].arXiv:1405.0312,2014.
[18] DIEDERIK P.Adam:a method for sthastic optimization[J].arXiv:1412.6980,2014.
[19] WANG P,NG H T.A beam-search decoder for normalization of social media text with application to machine translation[C]//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,2013.
[20] 靳华中,刘潇龙,胡梓珂.一种结合全局和局部特征的图像描述生成模型[J].应用科学学报,2019,37(4):501-509.
JIN H Z,LIU X L,HU Z K.An image caption generation model combining global and local features[J].Journal of Applied Sciences,2019,37(4):501-509.
[21] LI L H,TANG S,ZHANG Y D,et al.GLA:global-local attention for image description[J].IEEE Transactions on Multimedia,2018,20(3):726-737.