计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (11): 147-155.DOI: 10.3778/j.issn.1002-8331.2302-0321

• 模式识别与人工智能 • 上一篇    下一篇

融合词信息和图注意力的医学命名实体识别

赵珍珍,董彦如,刘静,张俊忠,曹慧   

  1. 山东中医药大学,济南 250000
  • 出版日期:2024-06-01 发布日期:2024-05-31

Medical Named Entity Recognition Incorporating Word Information and Graph Attention

ZHAO Zhenzhen, DONG Yanru, LIU Jing, ZHANG Junzhong, CAO Hui   

  1. Shandong?University?of?Traditional Chinese?Medicine, Jinan 250000, China
  • Online:2024-06-01 Published:2024-05-31

摘要: 中文临床自然语言中富含大量的病历信息,对电子病历进行命名实体识别有助于建立医学辅助诊断系统,对医学领域的发展具有重要的意义,同时有利于下游任务如关系提取、建立知识图谱的实现。但中文电子病历存在中文分词困难、医学专业术语多、含有特殊表达方式的问题,易造成文本特征表达错误,于是提出基于增强词信息和图注意力的医学命名实体识别研究模型,通过增强局部特征和全局特征提高网络模型的性能。由于嵌入单一的字向量进行中文实体识别易忽略文本中词信息及语义,为此在字向量中嵌入与其高度关联的词向量,既增强文本表示,又避免分词错误的问题,并且在嵌入层中嵌入了学习医疗知识的MedBert模型,该模型能根据不同语境动态生成特征向量,有助于解决电子病历中一词多义及专业词汇的问题。同时,在编码层中添加图注意力模块增强模型学习文本上下文关系的能力和对医疗特殊语法的学习。在cEHRNER和cMedQANER数据集的实验上分别获得了86.38%和84.76%的F1值,与其他模型相比有较好的鲁棒性。

关键词: 图注意力, 匹配词, 命名实体识别, Bert模型

Abstract: The Chinese clinical natural language is rich in a large amount of medical record information. Naming entity recognition for electronic medical records can help establish medical auxiliary diagnostic systems, which is of great significance for the development of the medical field. At the same time, it is conducive to downstream tasks such as relationship extraction and the implementation of knowledge graphs. However, Chinese electronic medical records have problems with difficulty in Chinese word segmentation, numerous medical terminology, and special expressions, which can easily lead to incorrect expression of text features. Therefore, this paper proposes a medical named entity recognition research model based on enhanced word information and graph attention, which improves the performance of the network model by enhancing local and global features. Due to the fact that embedding a single word vector for Chinese entity recognition can easily ignore word information and semantics in the text, this paper embeds a highly correlated word vector in the word vector, which not only enhances text representation but also avoids word segmentation errors. Additionally, a MedBert model for learning medical knowledge is embedded in the word embedding layer, which can dynamically generate feature vectors according to different contexts, helps solve the problem of polysemy and specialized vocabulary in electronic medical records. At the same time, adding a graph attention module in the coding layer enhances the network’s ability to learn text context relationships and enhances the model’s learning of medical special grammar. Finally, F1 values of 86.38% and 84.76% are obtained on the cEHRNER and cMedQANER datasets, respectively, showing better robustness compared to other models.

Key words: graph attention networks, word embedding, named entity recognition, Bert model