基于视觉特征引导融合的视频描述方法

doi:10.3778/j.issn.1002-8331.2103-0065

摘要/Abstract

摘要： 视频描述生成因其广泛的潜在应用场景而成为近年来的研究热点之一。针对模型解码过程中视觉特征和文本特征交互不足而导致描述中出现识别错误的情况，提出基于编解码框架下的视觉与文本特征交互增强的多特征融合视频描述方法。在解码过程中，该方法使用视觉特征辅助引导描述生成，不仅为每一步的生成过程提供了文本信息，同时还提供了视觉参考信息，引导其生成更准确的词，大幅度提升了模型产生的描述质量；同时，结合循环dropout缓解解码器存在的过拟合情况，进一步提升了评价分数。在该领域广泛使用的MSVD和MSRVTT数据集上的消融和对比实验结果证明，提出的方法的可以有效生成视频描述，综合指标分别增长了17.2和2.1个百分点。

关键词: 编解码框架, 视频描述, 特征融合, dropout, 特征交互

Abstract: Video captioning generation has become one of the research hotspots in recent years because of its wide range of potential applications. Aiming at the problem of recognition error caused by insufficient interaction between visual features and text features in the process of model decoding, a multi feature fusion video captioning method based on enhanced interaction between visual features and text features in the encoder-decoder framework is proposed. In the decoding process, the method exerts visual features to guide the captioning generation, which not only provides text information for each step of the generation process, but also provides visual reference information to guide it to generate more accurate words, which greatly improves the captioning quality of the model generation. At the same time, combined with recurrent dropout to alleviate the over fitting of decoder, the evaluation score is further improved. Experimental results on MSVD and MSRVTT datasets show that the proposed method can generate video captioning effectively, and the comprehensive score increases by 17.2 and 2.1 percentage points respectively.

Key words: encoder-decoder framework, video captioning, feature fusion, dropout, feature interaction

苗教伟, 季怡, 刘纯平. 基于视觉特征引导融合的视频描述方法[J]. 计算机工程与应用, 2022, 58(20): 124-131.

MIAO Jiaowei, JI Yi, LIU Chunping. Video Captioning Method Based on Visual Feature Guided Fusion[J]. Computer Engineering and Applications, 2022, 58(20): 124-131.

参考文献

[1] KOJIMA A，TAMURA T，FUKUNAGA K.Natural language description of human activities from video images based on concept hierarchy of actions[J].International Journal of Computer Vision，2002，50（2）：171-184.
[2] GUADARRAMA S，KRISHNAMOORTHY N，MALKARNENKAR G，et al.Youtube2text：recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]//Proceedings of the IEEE International Conference on Computer Vision，2013：2712-2719.
[3] VENUGOPALAN S，XU H，DONAHUE J，et al.Translating videos to natural language using deep recurrent neural networks[J].arXiv：1412.4729，2014.
[4] VENUGOPALAN S，ROHRBACH M，DONAHUE J，et al.Sequence to sequence-video to text[C]//Proceedings of the IEEE International Conference on Computer Vision，2015：4534-4542.
[5] YAO L，TORABI A，CHO K，et al.Describing videos by exploiting temporal structure[C]//Proceedings of the IEEE International Conference on Computer Vision，2015：4507-4515.
[6] Yan C，TU Y，WANG X，et al.STAT：spatial-temporal attention mechanism for video captioning[J].IEEE Transactions on Multimedia，2020，22（1）：229-241.
[7] YANG Y，ZHOU J，AI J，et al.Video captioning by adversarial LSTM[J].IEEE Transactions on Image Processing，2018，27（11）：5600-5611.
[8] PAN Y，MEI T，YAO T，et al.Jointly modeling embedding and translation to bridge video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：4594-4602.
[9] WANG B，MA L，ZHANG W，et al.Reconstruction network for video captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2018：7622-7631.
[10] GAN Z，GAN C，HE X，et al.Semantic compositional networks for visual captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017.
[11] PAN Y，YAO T，LI H，et al.Video captioning with transferred semantic attributes[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017.
[12] SUN L，LI B，YUAN C，et al.Multimodal semantic attention network for video captioning[C]//2019 IEEE International Conference on Multimedia and Expo（ICME），2019：1300-1305.
[13] WANG X，WANG Y F，WANG W Y.Watch，listen，and describe：globally and locally aligned cross-modal attentions for video captioning[J].arXiv：1804.05448，2018.
[14] PEI W，ZHANG J，WANG X，et al.Memory-attended recurrent network for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：8347-8356.
[15] WANG B，MA L，ZHANG W，et al.Controllable video captioning with pos sequence guidance based on gated fusion network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2019：2641-2650.
[16] ZHENG Q，WANG C，TAO D.Syntax-aware action targeting for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：13096-13105.
[17] MOON T，CHOI H，LEE H，et al.RnnDrop：a novel dropout for RNNs in ASR[C]//2015 IEEE Workshop on Automatic Speech Recognition and Understanding（ASRU），2015：65-70.
[18] SEMENIUTA S，SEVERYN A，BARTH E.Recurrent dropout without memory loss[J].arXiv：1603.05118，2016.
[19] KRUEGER D，MAHARAJ T，KRAMáR J，et al.Zoneout：regularizing RNNs by randomly preserving hidden activations[J].arXiv：1606.01305，2016.
[20] CHEN H，LIN K，MAYE A，et al.A semantics-assisted video captioning model trained with scheduled sampling[J].arXiv：1909.00121，2019.
[21] XU J，MEI T，YAO T，et al.Msr-vtt：a large video description dataset for bridging video and language[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2016：5288-5296.
[22] ZHANG H，WU C，ZHANG Z，et al.Resnest：split-attention networks[J].arXiv：2004.08955，2020.
[23] ZHANG S，GUO S，HUANG W，et al.V4D：4D convolutional neural networks for video-level representation learning[J].arXiv：2002.07442，2020.
[24] BENGIO S，VINYALS O，JAITLY N，et al.Scheduled sampling for sequence prediction with recurrent neural networks[J].arXiv：1506.03099，2015.
[25] AN S，BLEU T，HALLMARK O G，et al.Characterization of a novel subtype of human G protein-coupled receptor for lysophosphatidic acid[J].Journal of Biological Chemi-
stry，1998，273（14）：7906-7910.
[26] LIN C Y.Rouge：a package for automatic evaluation of summaries[C]//Workshop on Text Summarization Branches Out，2004：74-81.
[27] BANERJEE S，LAVIE A.METEOR：an automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization，2005：65-72.
[28] VEDANTAM R，LAWRENCE ZITNICK C，PARIKH D.CIDER：consensus-based image description evaluation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.Los Alamitos，CA：IEEE Computer Society，2015：4566-4575.
[29] ZOLFAGHARI M，SINGH K，BROX T.Eco：efficient convolutional network for online video understanding[C]//Proceedings of the European Conference on Computer Vision（ECCV），2018：695-712.
[30] LIU S，REN Z，YUAN J.SibNet：sibling convolutional encoder for video captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，43（9）：3259-3272.
[31] PAN B，CAI H，HUANG D A，et al.Spatio-temporal graph for video captioning with knowledge distillation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：10870-10879.
[32] ZHANG Z，SHI Y，YUAN C，et al.Object relational graph with teacher-recommended learning for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：13278-13288.
[33] AAFAQ N，AKHTAR N，LIU W，et al.Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：12487-12496.