计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (33): 1-3.

• 博士论坛 • 上一篇    下一篇

基于多级阈值的中文人名识别

余祖波1,高庆狮1,2,马建军1   

  1. 1.大连理工大学 计算机科学与工程系,辽宁 大连 116023
    2.北京科技大学 智能、语言与计算机科学研究所,北京 100083
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-11-21 发布日期:2007-11-21
  • 通讯作者: 余祖波

Chinese personal name recognition based on multilevel threshold

YU Zu-bo1,GAO Qing-shi1,2,MA Jian-jun1   

  1. 1.Department of Computer Science and Engineering,Dalian University of Technology,Dalian,Liaoning 116023,China
    2.Institute of Intelligence,Linguistics and Computer Science,University of Science and Technology Beijing,Beijing 100083,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-11-21 Published:2007-11-21
  • Contact: YU Zu-bo

摘要: 在对大规模姓名样本库统计的基础上,研究了各种中文人名的姓氏、名字用字规律,并通过对大规模语料库的统计分析,得到了每个姓氏用字在真实文本中用作真实姓氏的概率及其上下文规律;针对汉族人名和少数民族人名及音译人名,分别提出了多级姓氏阈值和多级首字阈值的概念,并使用3σ法则确定阈值。实验结果表明,基于多级阈值的中文人名识别模型是有效的。

关键词: 自然语言处理, 未登录词识别, 中文人名识别, 多级阈值, 3σ法则

Abstract: This paper presents the rules of surname words and name words of all kinds of Chinese personal names based on a large scale personal names base.It also shows the probability of all surname words being a surname and their contexts rules by making a statistics on a large scale corpus.In allusion to personal names of Chinese Han Nationality,multilevel threshold of surname is proposed.In order to recognize personal names of Chinese minority nationalities and transliterated personal names,it proposes multilevel threshold of the first word of personal name as well.And these thresholds are chosen by 3σ rule.The results show that the model of multilevel threshold is effective in recognizing Chinese personal names.

Key words: natural language processing, unknown words recognition, Chinese personal name recognition, multilevel threshold, 3σ rule