融合外部知识增强多模态命名实体识别

doi:10.3778/j.issn.1002-8331.2409-0116

摘要/Abstract

摘要： 多模态命名实体识别（multi-modal named entity recognition，MNER）旨在利用文本和图像等多种模态信息识别文本中预定义类型的实体。尽管现有方法取得了一定的进展，但仍然面临一些挑战：（1）难以建立统一的表示来弥合不同模态之间的鸿沟。（2）难以实现不同模态之间的高效语义交互。因此，提出了一种融合外部知识增强多模态命名实体识别模型。在模态表示阶段，该模型引入CLIP（contrastive language-image pre-training）模型，利用模型中蕴含的文本和图像先验跨模态知识信息，增强文本和图像的语义表示，弥补模态鸿沟。在模态融合阶段，设计了跨模态交叉注意力机制和跨模态门控机制实现模态信息融合，有效排除图像中的噪声信息，进一步增强语义交互；采用条件随机场（CRF）实现命名实体的识别。所提出的方法在公开数据集Twitter2015和Twitter2017上的F1值分别达到了75.35%和86.18%，证明了该方法的有效性。

关键词: 多模态命名实体识别（MNER）, CLIP模型, 跨模态交叉注意力机制, 跨模态门控机制, 条件随机场（CRF）

Abstract: Multi-modal named entity recognition (MNER) aims to use multiple modal information such as text and images to identify predefined types of entities in text. Although existing methods have made some progress, they still face some challenges: (1) It is difficult to establish a unified representation to bridge the gap between different modalities. (2) It is difficult to achieve efficient semantic interaction between different modalities. Therefore, this paper proposes an enhanced multi-modal named entity recognition model that incorporates external knowledge. Firstly, in the modal representation stage, the model introduces the contrastive language-image pre-training (CLIP) model, which uses the text and image prior cross-modal knowledge information contained in the model to enhance the semantic representation of text and images and compensate for the modalities chasm. Secondly, in the modal fusion stage, a cross-modal cross-attention mechanism and a cross-modal gating mechanism are designed to achieve modal information fusion, in order to effectively eliminate noise information in the image and further enhance semantic interaction. Finally, the conditional random field (CRF) is used to realize the recognition of named entities. The F1 values of the proposed method reaches 75.35% and 86.18% on the benchmark datasets Twitter2015 and Twitter2017 respectively, validating the effectiveness of this method.

Key words: multi-modal named entity recognition (MNER), contrastive language-image pre-training (CLIP) , cross-modal cross-attention mechanism, cross-modal gating mechanism, conditional

马裕鹏, 张明, 李志强, 高梓灵. 融合外部知识增强多模态命名实体识别[J]. 计算机工程与应用, 2025, 61(23): 195-204.

MA Yupeng, ZHANG Ming, LI Zhiqiang, GAO Ziling. Integrating External Knowledge to Enhance Multi-Modal Named Entity Recognition[J]. Computer Engineering and Applications, 2025, 61(23): 195-204.

参考文献

[1] GRISHMAN R, SUNDHEIM B. Message understanding conference-6: a brief history[C]//Proceedings of the 16th Conference on Computational Linguistics, 1996: 466.
[2] TJONG KIM SANG E F, DE MEULDER F. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition[C]//Proceedings of the 7th Conference on Natural Language Learning, 2003: 142-147.
[3] 李莉, 奚雪峰, 盛胜利, 等. 深度学习中文命名实体识别研究进展[J]. 计算机工程与应用, 2023, 59(24): 46-69.
LI L, XI X F, SHENG S L, et al. Research progress on named entity recognition in Chinese deep learning[J]. Computer Engineering and Applications, 2023, 59(24): 46-69.
[4] LI J, SUN A X, HAN J L, et al. A survey on deep learning for named entity recognition[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 34(1): 50-70.
[5] 赵继贵, 钱育蓉, 王魁, 等. 中文命名实体识别研究综述[J]. 计算机工程与应用, 2024, 60(1): 15-27.
ZHAO J G, QIAN Y R, WANG K, et al. Survey of Chinese named entity recognition research[J]. Computer Engineering and Applications, 2024, 60(1): 15-27.
[6] MOON S, NEVES L, CARVALHO V. Multimodal named entity recognition for short social media posts[J]. arXiv:1802.07862, 2018.
[7] XU B, HUANG S Z, SHA C F, et al. MAF: a general matching and alignment framework for multimodal named entity recognition[C]//Proceedings of the 15th ACM International Conference on Web Search and Data Mining. New York: ACM, 2022: 1215-1223.
[8] WANG X Y, GUI M, JIANG Y, et al. ITA: image-text alignments for multi-modal named entity recognition[J]. arXiv:2112.06482, 2021.
[9] WANG P, CHEN X H, SHANG Z Y, et al. Multimodal named entity recognition with bottleneck fusion and contrastive learning[J]. IEICE Transactions on Information and Systems, 2023(4): 545-555.
[10] VASWANI A. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[11] YU J F, JIANG J, YANG L, et al. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3342-3352.
[12] ZHANG D, WEI S Z, LI S S, et al. Multi-modal graph fusion for named entity recognition with targeted visual guidance[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 14347-14355.
[13] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning, 2021: 8748-8763.
[14] TSAI Y H, BAI S, LIANG P, et al. Multimodal Transformer for unaligned multimodal language sequences[C]//Proceedings of the Conference on Association for Computational Linguistics, 2019: 6558-6569.
[15] LAFFERTY J, MCCALLUM A, PEREIRA F. Conditional random fields: probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning, 2019 :282-289.
[16] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12: 2493-2537.
[17] HUANG Z, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging[J]. arXiv:1508.01991, 2015.
[18] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[J]. arXiv:1810.04805, 2018.
[19] CONNEAU A, KHANDELWAL K, GOYAL N, et al. Unsupervised cross-lingual representation learning at scale[J]. arXiv:1911.02116, 2019.
[20] LI X J, SUN G L, LIU X Y. ESPVR: entity spans position visual regions for multimodal named entity recognition[C]//Proceedings of the Association for Computational Linguistics. Stroudsburg:ACL, 2023: 7785-7794.
[21] CHEN X, ZHANG N, LI L, et al. Good visual guidance makes a better extractor: hierarchical visual prefix for multimodal entity and relation extraction[J]. arXiv:2205.03521, 2022.
[22] ZHOU B H, ZHANG Y, SONG K H, et al. A span-based multimodal variational autoencoder for semi-supervised multimodal named entity recognition[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2022: 6293-6302.
[23] LIU L P, WANG M L, ZHANG M Z, et al. UAMNer: uncertainty-aware multimodal named entity recognition in social media posts[J]. Applied Intelligence, 2022, 52(4): 4109-4125.
[24] SUN L, WANG J Q, SU Y D, et al. RIVA: a pre-trained tweet multimodal model based on text-image relation for multimodal NER[C]//Proceedings of the 28th International Committee on Computational Linguistics, 2020: 1852-1862.
[25] JIA M, SHEN L, SHEN X, et al. MNER-QG: an end-to-end MRC framework for multimodal named entity recognition with query grounding[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2023: 8032-8040.
[26] WANG J, YANG Y, LIU K Y, et al. M3S: scene graph driven multi-granularity multi-task learning for multi-modal NER[J]. ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 111-120.
[27] GONG Y C, LV X Q, YUAN Z, et al. GNN-based multimodal named entity recognition[J]. The Computer Journal, 2024, 67(8): 2622-2632.
[28] ZHANG Z X, CHEN J Y, LIU X J, et al. ‘what’ and ‘where’ both matter: dual cross-modal graph convolutional networks for multimodal named entity recognition[J]. International Journal of Machine Learning and Cybernetics, 2024, 15(6): 2399-2409.
[29] WANG X W, TIAN J F, GUI M, et al. PromptMNER: prompt-based entity-related visual clue extraction and integration for multimodal named entity recognition[C]//Proceedings of the International Conference on Database Systems for Advanced Applications, 2022: 297-305.
[30] LI J, LI H, SUN D, et al. LLMs as bridges: reformulating grounded multimodal named entity recognition[J]. arXiv:2402.09989, 2024.
[31] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[C]//Advances in Neural Information Processing Systems, 2022: 23716-23736.
[32] LI J, LI D, XIONG C, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//Proceedings of the International Conference on Machine Learning, 2022: 12888-12900.
[33] LI J, LI D, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the International Conference on Machine Learning, 2023: 19730-19742.
[34] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[J]. arXiv:2302.13971, 2023.
[35] SANG E F, VEENSTRA J. Representing text chunks[C]//Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, 1999: 173-179.
[36] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[37] ZHANG Q, FU J L, LIU X Y, et al. Adaptive co-attention network for named entity recognition in tweets[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2018: 5674-5681.
[38] LU D, NEVES L, CARVALHO V, et al. Visual attention model for name tagging in multimodal social media[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 1990-1999.
[39] MA X, HOVY E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[J]. arXiv:1603.01354, 2016.
[40] SOUZA F, NOGUEIRA R, LOTUFO R. BERTimbau: pretrained BERT models for Brazilian Portuguese[C]//Proceedings of the 9th Brazilian Conference on Intelligent Systems, 2020: 403-417.
[41] LIU P P, WANG G S, LI H, et al. Multi-granularity cross-modal representation learning for named entity recognition on social media[J]. Information Processing & Management, 2024, 61(1): 103546.