一种人名识别方法的研究

doi:10.3778/j.issn.1002-8331.2008.21.044

计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (21): 157-161.DOI: 10.3778/j.issn.1002-8331.2008.21.044

一种人名识别方法的研究

张素香¹,张素贤²,王小捷³

1.华北电力大学电子与通信工程系，河北保定 071003
2.河北大学卫生职业技术学院，河北保定 071000
3.北京邮电大学信息工程学院，北京 100876

收稿日期:2008-04-30 修回日期:2008-06-18 出版日期:2008-07-21 发布日期:2008-07-21
通讯作者: 张素香

Study of personal name recognition

ZHANG Su-xiang¹,ZHANG Su-xian²,WANG Xiao-jie³

1.Department of Electronic and Communication Engineering of North China Electric Power University，Baoding，Hebei 071003，China
2.Health Vocational and Technical Colleges of Hebei University，Baoding，Hebei 071000，China
3.School of Information Engineering of Beijing University of Posts and Telecommunications，Beijing 100876，China

Received:2008-04-30 Revised:2008-06-18 Online:2008-07-21 Published:2008-07-21
Contact: ZHANG Su-xiang

摘要/Abstract

摘要： 针对汉语人名识别的难点，基于最大熵算法提出了结合多知识、多模型的识别方法，充分考虑了人名的内部特征（小颗粒特征）和人名的语境信息。论文的主要贡献是：将概率信息赋予最大熵模型，极大提高人名的准确率和召回率；细化了分类模型，将人名识别分成中国人名识别、外国译名识别和单字人名识别；提出动态优先级方法来防止一个外国译名被部分识别为一个或几个中国人名。实验测试数据为1998年1月的人民日报和Sighan（2006）命名实体测试语料。测试结果表明，人民日报（1998-01）的召回率为90.06%，准确率为89.27%；Sighan（MSRA）语料的召回率为95.39%，准确率为96.71%；Sighan（LDC）语料的召回率为87.56%，准确率为91.04%。实验结果证明，提出的人名识别方法是非常有效的。

关键词: 最大熵, 概率特征, 自信度函数, 评测

Abstract: A new approach is proposed to recognize personal name，where，combining multi-knowledge and multi-model，the inner-feature of personal name and its context information are considered.This paper proposes a probabilistic feature based Maximum Entropy（ME） model for personal name recognition.Where，probabilistic feature functions are used instead of binary feature functions，it is one of the several differences between this model and the most of the previous ME based model.We also explore confidence functions.We use sub-models to model Chinese Person Names，foreign names and word-only name respectively.The dynamic priority method is used to prevent a foreign personal name from splitting a Chinese personal name and the other section.Experimental results show this ME model combining above new elements brings significant improvements.The experiment shows that recall is 90.06% and precision is 89.27% in People’s Daily（1998/01），recall is 95.39% and precision is 96.71% in SIGHAN MSRA corpus，and recall is 87.56% and precision is 91.04% in SIGHAN LDC corpus.

Key words: maximum entropy, probability feature, confidence function, evaluation

张素香¹,张素贤²,王小捷³. 一种人名识别方法的研究[J]. 计算机工程与应用, 2008, 44(21): 157-161.

ZHANG Su-xiang¹,ZHANG Su-xian²,WANG Xiao-jie³. Study of personal name recognition[J]. Computer Engineering and Applications, 2008, 44(21): 157-161.

[1]	杨舒，苏放. 基于微服务的分布式数据安全整合应用系统[J]. 计算机工程与应用, 2021, 57(18): 238-247.
[2]	周婉莹，马盈仓，续秋霞，郑毅. 最大熵和[l2,0]范数约束的无监督特征选择算法[J]. 计算机工程与应用, 2020, 56(11): 51-59.
[3]	安维华. 计算机辅助汉字书写教学技术研究综述[J]. 计算机工程与应用, 2019, 55(23): 1-6.
[4]	陈建平，陈其强，傅启明，高振，吴宏杰，陆悠. 基于生成对抗网络的最大熵逆强化学习[J]. 计算机工程与应用, 2019, 55(22): 119-126.
[5]	夏吾吉1，2，华却才让1. 基于混合策略的藏文人称代词指代消解研究[J]. 计算机工程与应用, 2018, 54(7): 66-69.
[6]	付哲1，2，李军2. 高性能正则表达式匹配算法综述[J]. 计算机工程与应用, 2018, 54(20): 1-13.
[7]	邵良杉1，赵琳琳1，温廷新2，孔祥博2. 基于区间直觉模糊数的双向投影决策模型[J]. 计算机工程与应用, 2017, 53(1): 83-86.
[8]	刘颖，王楠. 最大熵模型和BP神经网络的短句对齐比较[J]. 计算机工程与应用, 2015, 51(7): 112-117.
[9]	古丽扎达·海沙1，古丽拉·阿东别克2，3. 哈萨克语动词短语自动识别研究与实现[J]. 计算机工程与应用, 2015, 51(2): 218-223.
[10]	谷晶晶，周国栋. 基于分词与词性标注的汉语逗号自动分类[J]. 计算机工程与应用, 2015, 51(18): 120-125.
[11]	张曰云，于秀清. 内逆Pρ-集合与其概率特征[J]. 计算机工程与应用, 2014, 50(16): 123-126.
[12]	吴鹏. 萤火虫算法优化最大熵的图像分割方法[J]. 计算机工程与应用, 2014, 50(12): 115-119.
[13]	汪国强，曲晶莹. 改进分水岭医学图像分割方法的研究[J]. 计算机工程与应用, 2013, 49(8): 185-187.
[14]	郑丽，吕学强. 搜索引擎日志中“N+V+N”、“V+N+N”型短语识别[J]. 计算机工程与应用, 2013, 49(6): 143-147.
[15]	桑海岩1，2，古丽拉·阿东别克1，2，牛宁宁1，2. 基于最大熵的哈萨克语词性标注模型[J]. 计算机工程与应用, 2013, 49(11): 126-129.

一种人名识别方法的研究

Study of personal name recognition

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics