计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (23): 195-204.DOI: 10.3778/j.issn.1002-8331.2409-0116

• 模式识别与人工智能 • 上一篇    下一篇

融合外部知识增强多模态命名实体识别

马裕鹏,张明,李志强,高梓灵   

  1. 1.武汉工程大学 智能机器人湖北省重点实验室,武汉 430205 
    2.武汉工程大学 机电工程学院,武汉 430205
  • 出版日期:2025-12-01 发布日期:2025-12-01

Integrating External Knowledge to Enhance Multi-Modal Named Entity Recognition

MA Yupeng, ZHANG Ming, LI Zhiqiang, GAO Ziling   

  1. 1.Hubei Provincial Key Laboratory of Intelligent Robots, Wuhan Institute of Technology, Wuhan 430205, China
    2.School of Mechanical and Electrical Engineering, Wuhan Institute of Technology, Wuhan 430205, China
  • Online:2025-12-01 Published:2025-12-01

摘要: 多模态命名实体识别(multi-modal named entity recognition,MNER)旨在利用文本和图像等多种模态信息识别文本中预定义类型的实体。尽管现有方法取得了一定的进展,但仍然面临一些挑战:(1)难以建立统一的表示来弥合不同模态之间的鸿沟。(2)难以实现不同模态之间的高效语义交互。因此,提出了一种融合外部知识增强多模态命名实体识别模型。在模态表示阶段,该模型引入CLIP(contrastive language-image pre-training)模型,利用模型中蕴含的文本和图像先验跨模态知识信息,增强文本和图像的语义表示,弥补模态鸿沟。在模态融合阶段,设计了跨模态交叉注意力机制和跨模态门控机制实现模态信息融合,有效排除图像中的噪声信息,进一步增强语义交互;采用条件随机场(CRF)实现命名实体的识别。所提出的方法在公开数据集Twitter2015和Twitter2017上的F1值分别达到了75.35%和86.18%,证明了该方法的有效性。

关键词: 多模态命名实体识别(MNER), CLIP模型, 跨模态交叉注意力机制, 跨模态门控机制, 条件随机场(CRF)

Abstract: Multi-modal named entity recognition (MNER) aims to use multiple modal information such as text and images to identify predefined types of entities in text. Although existing methods have made some progress, they still face some challenges: (1) It is difficult to establish a unified representation to bridge the gap between different modalities. (2) It is difficult to achieve efficient semantic interaction between different modalities. Therefore, this paper proposes an enhanced multi-modal named entity recognition model that incorporates external knowledge. Firstly, in the modal representation stage, the model introduces the contrastive language-image pre-training (CLIP) model, which uses the text and image prior cross-modal knowledge information contained in the model to enhance the semantic representation of text and images and compensate for the modalities chasm. Secondly, in the modal fusion stage, a cross-modal cross-attention mechanism and a cross-modal gating mechanism are designed to achieve modal information fusion, in order to effectively eliminate noise information in the image and further enhance semantic interaction. Finally, the conditional random field (CRF) is used to realize the recognition of named entities. The F1 values of the proposed method reaches 75.35% and 86.18% on the benchmark datasets Twitter2015 and Twitter2017 respectively, validating the effectiveness of this method.

Key words: multi-modal named entity recognition (MNER), contrastive language-image pre-training (CLIP) , cross-modal cross-attention mechanism, cross-modal gating mechanism, conditional