计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (6): 106-111.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于层次聚类的跨文本中文人名消歧研究

张菲菲1,李宗海2,周晓辉1,李晓戈1,2   

  1. 1.西安邮电大学,西安 710121
    2.济南中林信息科技有限公司,济南 250100
  • 出版日期:2014-03-15 发布日期:2015-05-12

Cross-document Chinese personal name entity disambiguation based on hierarchical clustering

ZHANG Feifei1, LI Zonghai2, ZHOU Xiaohui1, LI Xiaoge1,2   

  1. 1.Xi’an University of Posts & Telecommunications, Xi’an 710121, China
    2.Jinan Zhonglin Information Technology Co., Ltd, Jinan 250100, China
  • Online:2014-03-15 Published:2015-05-12

摘要: 人名消歧已经成为自然语言处理和信息抽取应用中亟待解决的重要问题。运用中文自然语言处理和信息抽取系统识别命名实体和实体关系,生成实体信息对象(Entity Profile),采用实体信息对象(EP)中的个人信息特征,实体关系和上下文相关信息在Hadoop平台上基于凝聚的层次聚类方法解决了实体消歧问题。采用哈尔滨工业大学整理的全网新闻语料作为人名消歧训练和测试数据,着重研究了中文人名消歧特征的选取,参数的确定和验证,在训练集和测试集上分别取得了91.33%和88.73%的F值。说明提出的方法具有较好的可行性。

关键词: 人名消歧, 信息抽取, 相似度, 层次聚类

Abstract: Cross-document entity disambiguation is the problem of identifying whether mentions from different documents refer to the same or distinct entities. This paper describes a Chinese information extraction system which involves both document-level IE and corpus-level IE, a pipeline and multi-level modular approach to name entity and Entity Profile extraction. It introduces novel features based on document-level entity profiles and study on the influence of feature selection, parameter selection, parameter validation and analysis on results. Disambiguation is performed based on agglomerative hierarchical clustering using Hadoop. Experiments show that F-measure of training set is 91.33% and testing set is 88.73%, using the whole network news corpus dataset from Harbin Institute of Technology.

Key words: entity disambiguation, information extraction, similarity, hierarchical clustering