计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (18): 156-162.DOI: 10.3778/j.issn.1002-8331.2005-0355

• 模式识别与人工智能 • 上一篇    下一篇

基于XLnet语言模型的中文命名实体识别

姚贵斌,张起贵   

  1. 太原理工大学 信息与计算机学院,山西 晋中 030600
  • 出版日期:2021-09-15 发布日期:2021-09-13

Chinese Named Entity Recognition Based on XLnet Language Model

YAO Guibin, ZHANG Qigui   

  1. School of Information and Computer, Taiyuan University of Technology, Jinzhong, Shanxi 030600, China
  • Online:2021-09-15 Published:2021-09-13

摘要:

语言模型的建立对挖掘句子内部语义信息有着直接的影响,为了提高中文命名实体识别率,字的语义表示是关键所在。针对传统的中文命名实体识别算法没有充分挖掘到句子内部的隐藏信息问题,该文利用LSTM提取经过大规模语料预训练生成的字向量特征,同时将词向量预测矩阵传入到字向量特征提取阶段,通过矩阵运算融合为词向量特征,并进一步利用CNN提取词语之间的空间信息,将其与得到的词向量特征整合到一起输入语言模型XLnet(Generalized autoregressive pretraining for language understanding)中,然后经过BiGRU-CRF输出最优标签序列,提出了CAW-XLnet-BiGRU-CRF网络框架。并与其他的语言模型作了对比分析,实验结果表明,该框架解决了挖掘内部隐藏信息不充分问题,在《人民日报》1998年1月份数据集上的F1值达到了95.73%,能够较好地应用于中文命名实体识别任务。

关键词: 命名实体识别, 词向量, XLnet, 语言模型

Abstract:

The establishment of linguistic model has a direct impact on exploring the semantic information in sentences. To improve the recognition rate of Chinese named entities, the semantic representation of Chinese characters is the pointed. Aiming at the traditional Chinese named entity recognition algorithm has not fully tapped the hidden information inside the sentence, this article puts forward CAW-XLnet-BiGRU-CRF network framework by word-vector features generated by large-scale corpus pretraining with LSTM extract and uses CNN to extract spatial information between words, then integrates the extracted spatial information with the word vector features obtained and imports it into the language model XLnet (Generalized autoregressive pretraining for language understanding), finally outputs the optimal tag sequence by BiGRU-CRF. The experiment result shows that the F1 value of the framework in the January 1998 data set of People’s Daily reachs 95.73% and solves the problem of hidden information inner, which can be well applied to Chinese named entity recognition task.

Key words: named entity recognition, word vector, XLnet, language model