计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (17): 89-97.DOI: 10.3778/j.issn.1002-8331.2306-0024

• 理论与研发 • 上一篇    下一篇

改进的密集视频描述Transformer译码算法

杨大伟,盘晓芳,毛琳,张汝波   

  1. 大连民族大学 机电工程学院,辽宁 大连 116650
  • 出版日期:2024-09-01 发布日期:2024-08-30

Improved Transformer Decoding Algorithm for Dense Video Description

YANG Dawei, PAN Xiaofang, MAO Lin, ZHANG Rubo   

  1. School of Electromechanical Engineering, Dalian Minzu University, Dalian, Liaoning 116605, China
  • Online:2024-09-01 Published:2024-08-30

摘要: 当Transformer应用于密集视频描述时,历史文本特征会对后续文本生成产生干扰,难以捕捉视频动态信息从而影响描述的连贯性和准确性。为保持上下文一致性的同时又能缓解历史文本干扰,提出改进的密集视频描述Transformer译码算法(D-Uformer)。该算法利用前馈神经网络(FNN)增强历史文本特征表达,通过跳跃连接构建删除冗余支路和增强补足支路,利用减法降低历史文本特征过度聚焦导致描述不准确的影响,提高模型对输入视频特征的关注度;同时,利用加法弥补特征传递过程中丢失的上下文信息,生成准确且连贯表达当前视频内容的描述语句。在ActivityNet和Charades数据集上的实验结果表明,D-Uformer算法的描述性能提升明显,与视频多样性描述网络(TDPC)相比,准确性最高提升4.816%,多样性最高提升4.167%,生成的描述不仅更贴合视频内容,且更符合人类语言习惯。

关键词: 密集视频描述, Transformer网络, 译码, 前馈神经网络, 跳跃连接

Abstract: When applying Transformer for dense video description, historical text features can interfere with subsequent text generation, making it difficult to capture dynamic video information and affecting the coherence and accuracy of the descriptions. To maintain context consistency while mitigating historical text noise, this paper proposes an improved Transformer decoding algorithm for dense video description, called D-Uformer. This algorithm utilizes feedforward neural network (FNN) to enhance the representation of historical text features. It constructs pruning branches to remove redundant information and compensatory branches to enhance contextual information through skip connections, and uses subtraction to reduce the impact of inaccurate descriptions caused by over-focusing on historical text features and improves the model’s attention to input video features. Additionally, it uses addition to compensate for the loss of contextual information during feature transfer, and generates accurate and coherent descriptions of the current video content. Experimental results on the ActivityNet and Charades datasets demonstrate a significant performance improvement of the D-Uformer algorithm. Compared to the temporally descriptive probabilistic captioning (TDPC) network, it achieves a maximum accuracy improvement of 4.816% and a maximum diversity improvement of 4.167%. The generated descriptions not only align better with the video content but also conform more to human language conventions.

Key words: dense video description, Transformer network, decoding, feedforward neural network, skip connection