计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (6): 171-182.DOI: 10.3778/j.issn.1002-8331.2310-0392

• 模式识别与人工智能 • 上一篇    下一篇

多模态分级特征映射与融合表征方法研究

郭小宇,马静,陈杰   

  1. 南京航空航天大学 经济与管理学院,南京 211106
  • 出版日期:2025-03-15 发布日期:2025-03-14

Research on Multimodal Hierarchical Feature Mapping and Fusion Representation Method

GUO Xiaoyu, MA Jing, CHEN Jie   

  1. College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
  • Online:2025-03-15 Published:2025-03-14

摘要: 多模态特征表征是多模态任务的基础。为解决多模态特征表征方法融合层次单一、未能充分映射不同模态间的关联关系的问题,提出了一种多模态分级特征映射与融合表征方法。该方法在文本模型RoBERTa与图像模型DenseNet的基础上,从两个模型的中间层抽取由低级别到高级别的特征,基于特征重用的思想映射与融合文本与图像模态不同级别的特征,捕捉文本与图像模态之间的内部关联,充分融合两种模态之间的特征。将分级特征映射与融合表征馈入分类器,应用于多模态舆情的情感分类中,同时将构建的表征方法与基线表征方法进行对比分析。实验结果表明,提出的表征方法在微博舆情和MVSA-Multiple数据集上的情感分类性能均超越了所有基线模型,其中在微博数据集上F1值提升了0.013?7,在MVSA-Multiple数据集上F1值提升了0.022?2。图像特征能够提升文本单模态特征下的情感分类准确率,但是其提升程度与融合策略密切相关;多模态分级特征映射与融合表征方法能够有效映射文本与图像特征之间的关系,提升多模态舆情的情感分类效果。

关键词: 多模态特征融合, 分级特征, 映射与融合, 情感分类, 特征表示

Abstract: Multimodal feature representation serves as the foundation for multimodal tasks. To address the issue of a single-level fusion in existing multimodal feature representation methods, which fails to adequately capture the inter-modal relationships, a novel approach for multimodal hierarchical feature mapping and fusion representation is proposed. This method, built upon the text model RoBERTa and the image model DenseNet, extracts features from intermediate layers of both models spanning from low to high levels. Leveraging the concept of feature reuse, it maps and fuses features at different levels of the text and image modalities, capturing the internal relationships between text and image modalities and effectively integrating features between the two modalities. The hierarchical feature mapping and fusion representation is then fed into a classifier for sentiment classification in the context of multimodal sentiment analysis. A comparative analysis is also conducted between the constructed representation method and baseline representation methods. The experimental results indicate that the proposed representation method surpasses all baseline models in terms of sentiment classification performance on both the Weibo sentiment and MVSA-Multiple datasets. Specifically, it achieves a 0.013?7 increase in F1 score on the Weibo dataset and a 0.022?2 increase on the MVSA-Multiple dataset. Image features enhance sentiment classification accuracy under the single modality of text, but the degree of improvement is closely tied to the fusion strategy. The multimodal hierarchical feature mapping and fusion representation method effectively maps the relationship between text and image features, ultimately improving the effectiveness of sentiment classification in multimodal sentiment analysis.

Key words: multimodal feature fusion, hierarchical feature, map and fusion, sentiment classification, feature representation