Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (4): 198-205.DOI: 10.3778/j.issn.1002-8331.2104-0388

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Video Summarization Generation Based on Self-attention Mechanism and Random Forest Regression

LI Leiting, WU Guangli, GUO Zhenzhou   

  1. 1.School of Cyber Security, Gansu University of Political Science and Law, Lanzhou 730070, China
    2.Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, China
  • Online:2022-02-15 Published:2022-02-15



  1. 1.甘肃政法大学 网络空间安全学院,兰州 730070
    2.西北民族大学 中国民族语言文字信息技术教育部重点实验室,兰州 730030

Abstract: Video summarization is compressed by generating key frames or fragments, which can greatly shorten the viewing time on the basis of summarizing the main content of the video, and is widely used in the field of video quick browsing and retrieval. Most existing methods only explore based on image content, ignoring the time-series feature of the video and the poor learning ability of the model to wave data, which leads to the lack of time coherence and representativeness of the generated summarization. This paper proposes a video summarization network based on encoder-decoder framework. In particular, the coding part extracts characteristics by the convolution neural network, uses the attention mechanism to improve the weight of key characteristics. And the decoding part is formed by fusing the random forest and bi-directional long short-term memory network, by adjusting the proportion of random forest and bi-directional long short-term memory network in the loss function, the model has strong stability and prediction accuracy. Compared with the other seven methods on two datasets, the experimental results show that the proposed method is effective and feasible. This paper proposes the self-attention mechanism and random forest regression video summarization network to optimize the features by using the self-attention mechanism, and combines the bi-directional long short-term memory network with random forest to improve the stability and generalization of the model, effectively reduces the loss value, and makes the generated video summarization more consistent with the visual characteristics of users.

Key words: computer vision, video summarization, self-attention mechanism, long short-term memory, random forest regression

摘要: 视频摘要是通过生成关键帧或片段来达到压缩视频的效果,能够在概括视频主要内容的基础上极大缩短观看时间,在视频快速浏览与检索领域应用广泛。现有方法大多只基于图像内容进行探索,忽略了视频具有时序的特点,且模型对波动数据学习能力较差,导致生成的摘要缺乏时间连贯性和代表性。提出了一个以编码器-解码器为框架的视频摘要网络。具体来说,编码部分由卷积神经网络提取特征,通过自注意力机制提升对关键特征的权重,而解码部分由融合了随机森林的双向长短期记忆网络构成,通过调整随机森林和双向长短期记忆网络在损失函数中所占比例,使模型具有较强的稳定性和预测准确率。实验在两个数据集上与其他七种方法进行了比较,综合实验结果证明了方法的有效性与可行性。提出了自注意力机制和随机森林回归的视频摘要网络,利用自注意力机制完成对特征的优化,将双向长短期记忆网络与随机森林结合,提升模型的稳定性与泛化性,有效降低损失值,使得生成的视频摘要更符合用户视觉特性。

关键词: 计算机视觉, 视频摘要, 自注意力机制, 长短期记忆网络, 随机森林回归