计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (17): 222-231.DOI: 10.3778/j.issn.1002-8331.2405-0169

• 模式识别与人工智能 • 上一篇    下一篇

融合词汇增强和跨度方法的中医药命名实体识别

叶青,赖煊,程春雷,杨琴   

  1. 1.江西中医药大学 计算机学院,南昌 330004 
    2.江西中医药大学 中医人工智能重点研究室,南昌 330004
  • 出版日期:2025-09-01 发布日期:2025-09-01

Named Entity Recognition for Traditional Chinese Medicine with Lexical Enhancement and Span Method

YE Qing, LAI Xuan, CHENG Chunlei, YANG Qin   

  1. 1.School of Computer Science, Jiangxi University of Chinese Medicine, Nanchang 330004, China
    2.Key Laboratory of Artificial Intelligence in Chinese Medicine, Jiangxi University of Chinese Medicine, Nanchang 330004, China
  • Online:2025-09-01 Published:2025-09-01

摘要: 中医药命名实体识别旨在从非结构化的中医药文本中识别出相应的实体及其类别,采用人工识别效率不高。然而,传统的中文命名实体识别模型缺少中医药文本中的特征信息且一般采用序列标注方式解码,无法解决中医药实体识别中突出存在的实体边界识别模糊和实体嵌套性错误等问题。为解决上述问题,提出融合词汇增强与跨度方法的中医药命名实体识别模型TCM-NER来提升实体识别性能。根据词汇匹配获得文本中的词汇信息并利用相对位置构建中医药文本词格结构;通过特征提取模块分别提取字、词汇和相对位置编码向量;采用FLAT(flat-lattice Transformer)模型进行特征整合,从而获得<字-词汇-跨度>混合特征,提高模型边界识别性能;将混合特征输入双仿射分类器预测实体及其类别。实验结果表明,TCM-NER模型在两个中医药数据集的Micro-F1值分别达到了70.53%和75.91%,证明了该模型在中医药实体识别中的实用价值。

关键词: 词汇增强, 跨度方法, 命名实体识别, 中医药(TCM), 双仿射分类器

Abstract: Named entity recognition (NER) for traditional Chinese medicine (TCM) aims to identify corresponding entities and their categories from unstructured text, a task that is inefficient when manually performed. However, conventional Chinese named entity recognition models lack the feature information in TCM text and often use sequence annotation for decoding, which is unable to address prominent issues in recognizing TCM entities, such as ambiguous entity boundary recognition and nested entity recognition. To solve the above problems, a novel method with lexical enhancement and span method is proposed to enhance entity recognition performance, which is named TCM-NER. Firstly, the lexical information in the text is obtained according to lexical matching and the relative location information is used to construct the flat-lattice structure for TCM text. Secondly, character-level embedding, lexical-level embedding and relative position coding embedding are extracted by the feature extraction module. Thirdly, a FLAT (flat-lattice Transformer) model is used to integrate the features, which obtains the <character-lexical-span> fusion feature, thereby improving the boundary recognition performance. Finally, the fusion feature is input into a biaffine classifier to predict the entity and its categories. Experimental results on two datasets indicate that TCM-NER model gets Micro-F1 scores of 70.53% and 75.91%, demonstrating its practical value in the recognition of TCM entities.

Key words: lexical enhancement, span method, named entity recognition, traditional Chinese medicine(TCM), biaffine classifier