计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (14): 107-113.DOI: 10.3778/j.issn.1002-8331.2208-0299

• 模式识别与人工智能 • 上一篇    下一篇

融合多模态特征的新闻短视频分类模型

曾祥玖,刘达维,刘逸凡,赵志滨,柳秀梅,任酉贵   

  1. 1.东北大学 计算机科学与工程学院,沈阳 110169
    2.辽宁省自然资源事务服务中心,沈阳 110001
  • 出版日期:2023-07-15 发布日期:2023-07-15

News Short Video Classification Model Fusing Multimodal Feature

ZENG Xiangjiu, LIU Dawei, LIU Yifan, ZHAO Zhibin, LIU Xiumei, REN Yougui   

  1. 1.School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
    2.Service Center of Natural Resource Affairs of Liaoning Province, Shenyang 110001, China
  • Online:2023-07-15 Published:2023-07-15

摘要: 视频分类是理解、归纳和检索视频数据的一个重要环节。新闻短视频具有音频信息比图像信息更能完整地描述新闻事件的特点,但传统视频分类模型常常只考虑图像信息或融合了音频和图像的多模态信息,并没有考虑模态信息之间的主辅关系。针对上述问题,采用以音频模态为主,图像模态为辅的融合机制,提出了融合多模态特征的新闻短视频分类模型。为进一步利用音频为主的特点,采用两阶段训练方式,使用音频模态单独训练,音频和图像模态联合训练,利用图像信息修正分类结果,提升新闻短视频分类的准确率。为训练和评价模型,采集了10?304个新闻联播短视频作为实验数据集,总时长约为240?h。实验结果表明,所提模型的分类效果优于传统的新闻短视频分类模型。

关键词: 音画关系, 多模态特征融合, 新闻短视频分类

Abstract: Video classification is an important part of understanding, summarizing and retrieving video data. News short video has the feature that audio information can describe news events more completely than image information, while traditional video classification models often only consider image information or fuse multimodal information of audio and image, which do not consider the primary-secondary relationship between modal information. To address the above problems, a news short video classification model fusing multimodal feature is proposed. It is designed with the fusion mechanism of audio modality as the main and image modality as the auxiliary. In order to make further use of the audio-dominated feature, a two-stage training mode is adopted. Firstly, the audio mode is trained separately, and then the audio and image modes are trained jointly. The image information is used to correct the classification results, so as to improve the accuracy of news short video classification. For the purpose of the model in training and evaluation, 10 304 news broadcast short videos have been collected as experimental dataset, with a total time of about 240 hours. The experimental results show that the classification effect of the proposed model is better than the traditional news short video classification model.

Key words: audio-visual relationship, multimodal feature fusion, news short video classification