Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (20): 124-131.DOI: 10.3778/j.issn.1002-8331.2103-0065

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Video Captioning Method Based on Visual Feature Guided Fusion

MIAO Jiaowei, JI Yi, LIU Chunping   

  1. School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006, China
  • Online:2022-10-15 Published:2022-10-15



  1. 苏州大学 计算机科学与技术学院,江苏 苏州 215006

Abstract: Video captioning generation has become one of the research hotspots in recent years because of its wide range of potential applications. Aiming at the problem of recognition error caused by insufficient interaction between visual features and text features in the process of model decoding, a multi feature fusion video captioning method based on enhanced interaction between visual features and text features in the encoder-decoder framework is proposed. In the decoding process, the method exerts visual features to guide the captioning generation, which not only provides text information for each step of the generation process, but also provides visual reference information to guide it to generate more accurate words, which greatly improves the captioning quality of the model generation. At the same time, combined with recurrent dropout to alleviate the over fitting of decoder, the evaluation score is further improved. Experimental results on MSVD and MSRVTT datasets show that the proposed method can generate video captioning effectively, and the comprehensive score increases by 17.2 and 2.1 percentage points respectively.

Key words: encoder-decoder framework, video captioning, feature fusion, dropout, feature interaction

摘要: 视频描述生成因其广泛的潜在应用场景而成为近年来的研究热点之一。针对模型解码过程中视觉特征和文本特征交互不足而导致描述中出现识别错误的情况,提出基于编解码框架下的视觉与文本特征交互增强的多特征融合视频描述方法。在解码过程中,该方法使用视觉特征辅助引导描述生成,不仅为每一步的生成过程提供了文本信息,同时还提供了视觉参考信息,引导其生成更准确的词,大幅度提升了模型产生的描述质量;同时,结合循环dropout缓解解码器存在的过拟合情况,进一步提升了评价分数。在该领域广泛使用的MSVD和MSRVTT数据集上的消融和对比实验结果证明,提出的方法的可以有效生成视频描述,综合指标分别增长了17.2和2.1个百分点。

关键词: 编解码框架, 视频描述, 特征融合, dropout, 特征交互