电子病历命名实体识别技术研究综述

doi:10.3778/j.issn.1002-8331.2204-0272

摘要/Abstract

摘要： 电子病历（EMR）是医疗信息快速发展的产物，目前以非结构化文本形式存储。通过使用自然语言处理（NLP）技术，在非结构化文本中提取出大量医学实体，将有助于提升医务人员查阅病历效率，同时识别的成果也将辅助于接下来的关系提取和知识图谱构建等研究。介绍常用的若干个数据集、语料标注标准和评价指标。从早期传统方法、深度学习方法、预训练模型、小样本问题处理四个方面详细阐述电子病历命名实体识别方法，对比分析各模型自身的优势及局限性。探讨了目前研究的不足，并对未来发展方向提出展望。

关键词: 电子病历, 自然语言处理, 命名实体识别, 深度学习

Abstract: Electronic medical records（EMR） are a product of the rapid development of medical information and are currently stored in the form of unstructured text. By using natural language processing（NLP） techniques to extract a large number of medical entities in unstructured text, it will help to improve the efficiency of medical personnel in accessing medical records, while the results of identification will also assist in the next research such as relationship extraction and knowledge graph construction. This paper introduces several commonly used datasets, corpus annotation criteria and evaluation metrics. This paper elaborates on the named entity recognition methods of electronic medical records from four aspects：early traditional methods, deep learning methods, pre-trained model, and small sample problem processing, and compares and analyzes the advantages and limitations of each model itself. The shortcomings of the current research are discussed, and the future development direction is proposed.

Key words: electronic medical records（EMR）, natural language processing（NLP）, named entity identification, deep learning

吴智妍, 金卫, 岳路, 生慧. 电子病历命名实体识别技术研究综述[J]. 计算机工程与应用, 2022, 58(21): 13-29.

WU Zhiyan, JIN Wei, YUE Lu, SHENG Hui. Review of Research on Named Entity Recognition Technologies for Electronic Medical Records[J]. Computer Engineering and Applications, 2022, 58(21): 13-29.

参考文献

[1] 陈衡，黄刊迪.结构化电子病历概述[J].中国数字医学，2011，6（5）：36-39.
CHEN H，HUANG K D.The overview of structuring electronic medical record[J].China Digital Medicine，2011，6（5）：36-39.
[2] CHINCHOR N.MUC-6 named entity task definition（version 2.1）[C]//6th Message Understanding Conference，Columbia，Maryland，1995.
[3] BITTERMAN D S，MILLER T A，MAK R H，et al.Clinical natural language processing for radiation oncology：a review and practical primer[J].International Journal of Radiateion Oncology Biology Physics，2021，110（3）：641-655.
[4] ROBERTS A.Language，structure，and reuse in the electronic health record[J].AMA Journal of Ethics，2017，19（3）：281-288.
[5] SAVOVA G K，DANCIU I，ALAMUDUN F，et al.Use of natural language processing to extract clinical cancer phenotypes from electronic medical records[J].Cancer Research，2019，79：5463-5470.
[6] 杨锦锋，于秋滨，关毅，等.电子病历命名实体识别和实体关系抽取研究综述[J].自动化学报，2014，40（8）：1537-1562.
YANG J F，YU Q B，GUAN Y，et al.An overview of research on electronic medical record oriented named entity recognition and entity relation extraction[J].Acta Automatica Sinica，2014，40（8）：1537-1562.
[7] 崔博文，金涛，王建民.自由文本电子病历信息抽取综述[J].计算机应用，2021，41（4）：1055-1063.
CUI B W，JIN T，WANG J M.Overview of information extraction of free-text electronic medical records[J].Journal of Computer Applications，2021，41（4）：1055-1063.
[8] 吴宗友，白昆龙，杨林蕊，等.电子病历文本挖掘研究综述[J].计算机研究与发展，2021，58（3）：513-527.
WU Z Y，BAI K L，YANG L R，et al.Review on text ming of electronic medical record[J].Jonrnal of Computer Research and Development，2021，58（3）：513-527.
[9] 曲春燕，关毅，杨锦锋，等.中文电子病历命名实体标注语料库构建[J].高技术通讯，2015（2）：143-150.
QU C Y，GUAN Y，YANG J F，et al.The construction of annotated corpora of named entities for Chinese electronic medical records[J].High-Tech Communications，2015（2）：143-150.
[10] 杨晓辉.基于中文电子病历的冠心病危险因素抽取方法研究[D].乌鲁木齐：新疆大学，2019.
YANG X H.Research on risk factors for coronary heart disease extraction based on Chinese electronic medical records[D].Urumqi：Xinjiang University，2019.
[11] 杨锦锋，关毅，何彬，等.中文电子病历命名实体和实体关系语料库构建[J].软件学报，2016，27（11）：2725-2746.
YANG J F，GUAN Y，HE B，et al.Corpus construction for named entities and entity relations on Chinese electronic medical records[J].Journal of Software，2016，27（11）：2725-2746.
[12] 苏嘉，何彬，吴昊，等.基于中文电子病历的心血管疾病风险因素标注体系及语料库构建[J].自动化学报，2019，45（2）：420-426.
SU J，HE B，WU H，et al.Annotation scheme and corpus construction for cardiovacular diseases risk factor from Chinese electronic medical records[J].Acta Automatica Sinica，2019，45（2）：420-426.
[13] RAMSHAW L，MARCUS M P.Text chunking using transformation-based learning[C]//Third Workshop on Very Large Corpora，1995：82-94.
[14] SANG E T，VEENSTR A J.Representing text chunks[C]//Conference of the European Chapter of the Association for Computational Linguistics，1999：173-179.
[15] UCHIMOTO K，MA Q，MUＲATA M，et al.Named entity extraction based on a maximum entropy model and transformation rules[C]//Meeting of the Association for Computational Linguistics，2000：326-335.
[16] SHANG J B，LIU L Y，GU X T，et al.Learning named entity tagger using domain-specific dictionary[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing，2018：2054-2064.
[17] XU Y，WANG Y，LIU T，et al.Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries[J].Journal of the American Medical Informatics Association，2014，21：84-92.
[18] WANG H，ZHANG W，ZENG Q，et al.Extracting important information from Chinese operation notes with natural language processing methods[J].Journal of Biomedical Informatics，2014，48：130-136.
[19] KRAUS S，BLAKE C，WEST S L.Information extraction from medical notes[C]//Proceedings of the 12th World Congress on Health（medical），Informatics，Building，Sustainable Health System，2007：1-2.
[20] RABINER L，JUANG B.An introduction to hidden Markov models[J].IEEE ASSP Magazine，1986，3（1）：4-16.
[21] JAYNESE T.Information theory and statistical mechanics[J].Physical Review，1957，106（4）：620-630.
[22] CORTES C，VAPINIK V.Support vector networks[J].Machine Learning，1995，20：273-297.
[23] LAFFERTY J，MCCALLUM A，PEREIRA F C N.Conditional random fields：probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning（ICML 2001），Williams College，USA，June 28-July 1，2001：282-289.
[24] DE BRUIJN B，CHERRY C，KIRITCHENKO S，et al.Machine learned solutions for three stages of clinical information extraction：the state of the art at i2b2 2010[J].Journal of the American Medical Informatics Association，2011，18（5）：557-562.
[25] 张坤丽，马鸿超，赵悦淑，等.基于自然语言处理的中文产科电子病历研究[J].郑州大学学报（理学版），2017，49（4）：40-45.
ZHANG K L，MA H C，ZHAO Y S，et al.The study of Chinese obstetric electronic medical records based on natural language processing[J].Journal of Zhengzhou University（Natural Science Edition），2017，49（4）：40-45.
[26] DOAN S，XU H.Recognizing medication related entities in hospital discharge summaries using support vector machine[C]//Proceedings of 23rd International Conference on Computational Linguistics，2010：259-266.
[27] JU Z，WANG J，ZHU F.Name density recognition from biomedical text using SVM[C]//Proceedings of IEEE 5th International Conference on Bioinformatics and Biomedical Engineering（ICBBE 2011），Wuhan，China，May 10-12，2011：1-4.
[28] TANG B，CAO H，WU Y.Recognizing clinical entities in hospital discharge summaries usingstructural support vector machines with word representation features[J].BMC Medical Informatics Decision Making，2013，13（S1）：1-10.
[29] 王世昆，李绍滋，陈彤生.基于条件随机场的中医命名实体识别[J].厦门大学学报（自然科学版），2009，48（3）：359-364.
WANG S K，LI S Z，CHEN T S.Recognition of Chinese medicine named entity based on condition random field[J].Journal of Xiamen University（Natural Science Edition），2009，48（3）：359-364.
[30] YE F，CHEN Y Y，ZHOU G G，et al.Intelligent recognition of named entity in electronic medical records[J].Chinese Journal of Biomedical Engineering，2011，30（2）：256-262.
[31] LIU K，HU Q，LIU J.Named entity recognition in Chinese electronic medical records based on CRF[C]//Proceedings of 14th Web Information Systems and Applications Conference（WISA2017），Guangxi，China，November 11-12，2017：105-110.
[32] LECUN Y，BOSER B，DENKER J S，et al.Back propagation applied to handwritten zip code recognition[J].Neural Computation，1989，1（4）：541-551.
[33] ARUNKUMAR K E，KALAGA D V，KUMAR C M S.Forecasting of COVID-19 using deep layer recurrent neural networks（RNNs） with gated recurrent units（GRUs） and long short-term memory（LSTM） cells[J].Chaos Solitons Fractals，2021，146：110861.
[34] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural Computation，1997，9（8）：1735-1780.
[35] ZHAO R，WANG D Z，YAN R Q，et al.Machine health monitoring using local feature-based gated recurrent unit networks[J].IEEE Transactions on Industrial Electronics，2018，65（2）：1539-1548.
[36] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017.
[37] WU Y H，JIANG M，LEI J B，et al.Named entity recognition in Chinese clinical text using deep neural network[J].Studies in Health Technology and Informatics，2015，216：624-628.
[38] YANG Z，HUANG Y，JIANG Y，et al.Clinical assistant diagnosis for electronic medical record based on convolutional neural network[J].Scientific Reports，2018，8（1）：1-9.
[39] LI Y，XU L，TIAN F，et al.Word embedding revisited：a new representation learning and explicit matrix factorization perspective[C]//Twenty-Fourth International Joint Conference on Artificial Intelligence，2015.
[40] YIN M W，MOU C J，XIONG K N，et al.Chinese clinical named entity recognition with radical-level feature and self-attention mechanism[J].Journal of Biomedical Informatics，2019，98：103289.
[41] ZHOU X，LI Y，LIANG W.CNN-RNN based intelligent recommendation for online medical pre-diagnosis support[J].IEEE/ACM Trans Computer Biol Bioinform，2021，18（3）：912-921.
[42] AL-RAKHAMI M S，ISLAM M M，ISLAM M Z，et al.Diagnosis of COVID-19 from X-rays using combined CNN-RNN architecture with transfer learning[C]// MEDRXIV，2021：1-15.
[43] LIU Z，YANG M，WANG X，et al.Entity recognition from clinical texts via recurrent neural network[J].BMC Medical Informatics & Decision Making，2017，17（2）：53-61.
[44] HUANG Z H，WEI X，KAI Y.Bidirectional LSTM-CRF models for sequence tagging[J].arXiv：1508.01991，2015.
[45] 李纲，潘荣清，毛进，等.整合BiLSTM-CRF网络和词典资源的中文电子病历实体识别[J].现代情报，2020，40（4）：3-12.
LI G，PAN R Q，MAO J，et al.Entity recognition of Chinese electronic medical records based on BiLSTM-CRF network and dictionary resources[J].Journal of Modern Information，2020，40（4）：3-12.
[46] 屈倩倩，阚红星.基于Bert-BiLSTM-CRF的中医文本命名实体识别[J].电子设计工程，2021，29（19）：40-43.
QU Q Q，KAN H X.Named entity recognition of Chinese medical text based on Bert-BiLSTM-CRF[J].Electronic Design Engineering，2021，29（19）：40-43.
[47] ZHU H，PASCHALIDIS I C，TAHMASEBI A.Clinical concept extraction with contextual word embedding[J].arXiv：1810.10566，2018.
[48] YAN J，GENG Y，XU H，et al.Research on named entity recognition in Chinese EMR based on semi-supervised learning with dual selected strategy[C]//2020 3rd International Conference on Algorithms，Computing and Artificial Intelligence，2020：1-10.
[49] 吴倩倩，周蕾蕾，陆小妍，等.基于多头自注意力机制与U-Net的增强CT图像肾脏小肿瘤自动分割研究[J].中国医学装备，2022，19（2）：27-31.
WU Q Q，ZHOU L L，LU X Y，et al.Study on the automatic segmentation of enhanced CT image of small kidney tumors based on MHSA mechanism and U-Net[J].China Medical Equipment，2022，19（2）：27-31.
[50] 巩敦卫，张永凯，郭一楠，等.融合多特征嵌入与注意力机制的中文电子病历命名实体识别[J].工程科学学报，2021，43（9）：1190-1196.
GONG D W，ZHANG Y K，GUO Y N，et al.Named entity recognition of Chinese electronic medical records based on multifeature embedding and attention mechanism[J].Chinese Journal of Engineering，2021，43（9）：1190-1196.
[51] 罗熹，夏先运，安莹，等.结合多头自注意力机制与BiLSTM-CRF的中文临床实体识别[J].湖南大学学报（自然科学版），2021，48（4）：45-55.
LUO X，XIA X Y，AN Y，et al.Chinese CNER combined with multi-head self-attention and BiLSTM-CRF[J].Journal of Hunan University（Natural Science），2021，48（4）：45-55.
[52] 张世豪，杜圣东，贾真，等.基于深度神经网络和自注意力机制的医学实体关系抽取[J].计算机科学，2021，48（10）：77-84.
ZHANG S H，DU S D，JIA Z.Medical entity relationship extraction based on deep neural network and self-attention mechanism[J].Computer Science，2021，48（10）：77-84.
[53] WEI Q，CHEN T，XU R，et al.Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks[J].Database（Oxford），2016，2016：baw140.
[54] 龚乐君，张知菲.基于领域词典与CRF双层标注的中文电子病历实体识[J].工程科学学报，2020，42（4）：469-475.
GONG L J，ZHANG，Z F.Clinical named entity recognition from Chinese electronic medical records using a double-layer annotation model combining a domain dictionary with CRF[J].Chinese Journal of Engineering，2020，42（4）：469-475.
[55] 陈德鑫，占袁圆，杨兵，等.基于CNN-BiLSTM模型的在线医疗实体抽取研究[J].图书情报工作，2019，63（12）：105-113.
CHEN D X，ZHAN Y Y，YANG B，et al.Research on extraction of online medical entities based on mixed deep learning model[J].Library and Information Service，2019，63（12）：105-113.
[56] 李丽双，郭元凯.基于CNN-BLSTM-CRF 模型的生物医学命名实体识别[J].中文信息学报，2018，32（1）：116-122.
LI L S，GUO Y K.Biomedical named entity recognition with CNN-BLSTM-CRF[J].Journal of Chinese Information Processing，2018，32（1）：116-122.
[57] LI X，WANG H，HE H，et al.Intelligent diagnosis with Chinese electronic medical records based on convolutional neural networks[J].BMC Bioinformatics，2019，20（1）：62.
[58] TANG B，WANG X，YAN J，et al.Entity recognition in Chinese clinical text using attention-based CNN-LSTM-CRF[J].BMC Medical Informatics & Decision Making，2019，19（3）：74.
[59] CHALAPATHY R，BORZESHI E Z，PICCARDI M.Bidirectional LSTM-CRF for clinical concept extraction[C]//Proceedings of the Clinical Natural Language Processing Workshop，2016：7-12.
[60] WILLIE B，ELENA S，SAURABH K，et al.CNER 2.0：accessible and accurate clinical concept extraction[J].arXiv：1803.02245，2018.
[61] ZHU H H，PASCHALIDI I C，TAHMASEBI A M.Clinical concept extraction with contextual word embedding[C]//NIPS Machine Learning for Health Workshop，2018.
[62] 沈宙锋，苏前敏，郭晶磊.基于XLNet-BiLSTM的中文电子病历命名实体识别方法[J].智能计算机与应用，2021，11（8）：97-102.
SHEN Z F，SU Q M，GUO J L.Named entity recognition model of Chinese clinical electronic medical record based on XLNet-BiLSTM[J].Intelligent Computer and Applications，2021，11（8）：97-102.
[63] 杨红梅，李琳，杨日东，等.基于双向LSTM神经网络电子病历命名实体的识别模型[J].中国组织工程研究，2018，22（20）：3237-3242.
YANG H M，LI L，YANG R D，et al.Named entity recognition based on bidirectional long short-term memory combined with case report form[J].Journal of Clinical Rehabilitative Tissue Engineering Research，2018，22（20）：3237-3242.
[64] JAGANNATHA A N，HONG Y.Structured prediction models for RNN based sequence labeling in clinical text[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing，2016：856-865.
[65] WESTON J，BENGIO S，USUNIER N.Wsabie：scaling up to large vocabulary image annotation[C]//The International Joint Conferences on Artificial Intelligence，2011：2764-2770.
[66] SOCHER R，LIN C C Y，NG A Y，et al.Parsing natural scenes and natural language with recursive neural networks[C]//Proceedings of the 28th International Conference on Machine Learning，Bellevue，WA，USA，2011：129-136.
[67] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781，2013.
[68] 黄艳群，王妮，刘红蕾，等.基于Skip-gram词嵌入算法的结构化患者特征表示方法研究[J].北京生物医学工程，2019，38（6）：568-574.
HUANG Y Q，WANG N，LIU H L，et al.Study on structured patient feature representation method based on Skip-gram word embedding algorithm[J].Beijing Biomedical Engineering，2019，38（6）：568-574.
[69] PENNINGTON J，SOCHER R，MANNING C.Glove：global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing（EMNLP），2014：1532
[70] 吴迪，赵玉凤.融合LDA和Glove模型的病症文本聚类算法[J].河北工程大学学报（自然科学版），2022，39（1）：92-98.
WU D，ZHAO Y F.Disease text clustering algorithm based on LDA and Glove model[J].Journal of Hebei University of Engineering（Natural Science Edition），2022，39（1）：92-98.
[71] 马满福，刘元喆，李勇，等.基于LCN的医疗知识问答模型[J].西南大学学报（自然科学版），2020，42（10）：25-36.
MA M F，LIU Y Z，LI Y，et al.LCN-based medical knowledge question answering model[J].Journal of Southwest University（Natural Science Edition），2020，42（10）：25-36.
[72] PETERS M E，NEUMANN M，IYYER M，et al.Deep contextualized word representations[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2018：2227-2237.
[73] JIN Q，DHINGRA B，COHEN W W，et al.Probing biomedical embeddings from language models[J].arXiv：1904.02181，2019.
[74] JOHNSON A E W，POLLARD T J，SHEN L，et al.MIMIC-III，a freely accessible critical care database[J].Scientific Data，2016，3（1）：1-9.
[75] YANG J，LIU Y，QIAN M，et al.Information extraction from electronic medical records using multitask recurrent neural network with contextual word embedding[J].Applied Sciences，2019，9（18）：3658.
[76] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2018：4171-4186.
[77] 李正民，云红艳，王翊臻.基于BERT的多特征融合的医疗命名实体识别[J].青岛大学学报（自然科学版），2021，34（4）：23-29.
LI Z M，YUN H Y，WANG Y Z.Medical named entity recognition based on multi feature fusion of BERT[J].Journal of Qingdao University（Natural Science Edition），2021，34（4）：23-29.
[78] VUNIKILI R，SUPRIYA H N，MARICA V G，et al.Clinical NER using Spanish BERT embeddings[C]//Iberian Languages Evaluation Forum，2020：505-511.
[79] YANG Z，DAI Z，YANG Y，et al.XLNet：generalized auto regressive pretraining for language understanding[C]//Proceedings of the 32nd Annual Conference on Neural Information Processing Systems，Vancouver，Dec 8-14，2019.Red Hook：Curran Associates，2019：5754-5764.
[80] YAN R，JIANG X，DANG D.Named entity recognition by using XLNet?BiLSTM?CRF[J].Neural Processing Letters，2021，53（5）：3339-3356.
[81] WEN S，ZENG B，LIAO W.Named entity recognition for instructions of Chinese medicine based on pre-trained language model[C]//2021 3rd International Conference on Natural Language Processing（ICNLP），2021：139-144.
[82] LEE J，YOON W，KIM S，et al.BioBERT：a pre-trained biomedical language representation model for biomedical text mining[J].Bioinformatics，2020，36（4）：1234-1240.
[83] YU X，HU W，LU S，et al.BioBERT based named entity recognition in electronic medical record[C]//International Conference on Information Technology in Medicine and Education（ITME），2019：49-52.
[84] SYMEONIDOU A，SAZONAU V，GROTH P.Transfer learning for biomedicalnamed entity recognition with BioBERT[C]//SEMANTICS Posters & Demos，2019：1-5.
[85] NASEEM U，MUSIAL K，EKLUND P，et al.Biomedical named-entity recognition by hierarchically fusing biobert representations and deep contextual-level word-embedding[C]//2020 International Joint Conference on Neural Networks（IJCNN），2020：1-8.
[86] RASMY L，XIANG Y，XIE Z，et al.Med-BERT：pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction[J].NPJ Digital Medicine，2021，4（1）：1-13.
[87] 杨飞洪.面向中文临床自然语言处理的BERT模型研究[D].北京：北京协和医学院，2021.
YANG F H.Research on BERT model for Chinese clinical language processing[D].Beijing：Peking Union Medical College，2021.
[88] GAN Z，LI Z，ZHANG B，et al.Enhance both text and label：combination strategies for improving the generalization ability of medical entity extraction[C]//China Conference on Knowledge Graph and Semantic Computing.Singapore：Springer，2021：92-101.
[89] ZHANG N，JIA Q，YIN K，et al.Conceptualized representation learning for Chinese biomedical text mining[J].arXiv：2008.10813，2020.
[90] 唐观根.中文电子病历命名实体识别研究[D].杭州：杭州电子科技大学，2020.
TANG G G.Research on named entity recognition of Chinese electronic medical records[D].Hangzhou：Hangzhou Dianzi University，2020.
[91] GIORGI J M，BADER G D.Transfer learning for biomedical named entity recognition with neural networks[J].Bioinformatics，2018，34（23）：4087-4094.
[92] LEE J Y，DERNONCOURT F，SZOLOVITS P.Transfer learning for named-entity recognition with neural networks[J].arXiv：1705.06273，2017.
[93] HOFER M，KORMILITZIN A，GOLDBERG P，et al.Few-shot learning for named entity recognition in medical text[J].arXiv：1811.05468，2018.
[94] LARA-CLARES A，GARCIA-SERRANO A.Key phrases annotation in medical documents：MEDDOCAN2019 anonymization task[C]//Iberian Languages Evaluation Forum，2019：755-760.
[95] XUE K，ZHOU Y，MA Z，et al.Fine-tuning BERT for joint entity and relation extraction in Chinese medical text[C]//2019 IEEE International Conference on Bioinformatics and Biomedicine，2019：892-897.