计算机工程与应用 ›› 2026, Vol. 62 ›› Issue (8): 231-240.DOI: 10.3778/j.issn.1002-8331.2503-0243

• 模式识别与人工智能 • 上一篇    下一篇

融合多模态信息与大语言模型的生成式命名实体识别方法

胡慧云1,2,葛杨1,崔凌潇1,唐琳1,齐思洋2,孔竣达2,肖波2+   

  1. 1.国网山东省电力公司 德州供电公司,山东 德州 253000
    2.北京邮电大学 人工智能学院,北京 100876 
    + 通信作者 E-mail:xiaobo@bupt.edu.cn
  • 收稿日期:2025-03-20 修回日期:2025-06-20 在线发布日期:2026-04-15 出版日期:2026-04-15
  • 基金资助:
    国家电网有限公司总部管理科技项目(5700-202416236A-1-1-ZN)。

Generative Named Entity Recognition Method Integrating Multimodal Information and Large Language Models

HU Huiyun1,2, GE Yang1, CUI Lingxiao1, TANG Lin1, QI Siyang2, KONG Junda2, XIAO Bo2+   

  1. 1.Dezhou Power Supply Company, State Grid Shandong Electric Power Company, Dezhou, Shandong 253000, China
    2.School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
    + Corresponding author E-mail:xiaobo@bupt.edu.cn
  • Received:2025-03-20 Revised:2025-06-20 Online:2026-04-15 Published:2026-04-15

摘要: 针对现有基于序列标注的多模态命名实体识别方法存在模型架构复杂、跨模态信息融合不充分等问题,提出一种基于大语言模型的生成式多模态命名实体识别方法。该方法设计包含三阶段处理流程:采用模态特异性编码器分别提取图文特征,继而设计跨模态注意力机制实现噪声鲁棒的特征融合,通过指令微调的大语言模型实现结构化实体生成。该方法的创新性体现为:有效挖掘大语言模型内隐的语义推理能力;构建具有跨模态一致性的高层语义表征。在Twitter-2015和Twitter-2017基准数据集上的实验表明,该方法F1值分别达到62.7%和66.8%,显著优于现有生成式方法。

关键词: 命名实体识别(NER), 大语言模型(LLMs), 多模态融合, 信息抽取

Abstract: A generative multimodal named entity recognition method based on large language models proposed to address the problems of complex model architecture and insufficient cross-modal information fusion in existing sequence label-based multimodal named entity recognition methods. This method design includes a three-stage processing flow:initially employing modality-specific encoders to independently capture visual and textual features, subsequently devising a cross-modal attention mechanism for noise-resistant feature fusion, and ultimately utilizing instruction-tuned large language models to produce structured entity outputs. The methodological novelty manifests in two aspects: effectively mining the implicit semantic reasoning ability of large language models; building high-level semantic representations with cross-modal consistency. Experiments on the benchmark datasets Twitter-2015 and Twitter-2017 show that this method achieves F1 values of 62.7% and 66.8%, respectively, which are significantly better than existing generative methods.

Key words: named entity recognition (NER), large language models (LLMs), multimodal fusion, information extraction