Computer Engineering and Applications ›› 2014, Vol. 50 ›› Issue (6): 106-111.
Previous Articles Next Articles
ZHANG Feifei1, LI Zonghai2, ZHOU Xiaohui1, LI Xiaoge1,2
Online:
Published:
张菲菲1,李宗海2,周晓辉1,李晓戈1,2
Abstract: Cross-document entity disambiguation is the problem of identifying whether mentions from different documents refer to the same or distinct entities. This paper describes a Chinese information extraction system which involves both document-level IE and corpus-level IE, a pipeline and multi-level modular approach to name entity and Entity Profile extraction. It introduces novel features based on document-level entity profiles and study on the influence of feature selection, parameter selection, parameter validation and analysis on results. Disambiguation is performed based on agglomerative hierarchical clustering using Hadoop. Experiments show that F-measure of training set is 91.33% and testing set is 88.73%, using the whole network news corpus dataset from Harbin Institute of Technology.
Key words: entity disambiguation, information extraction, similarity, hierarchical clustering
摘要: 人名消歧已经成为自然语言处理和信息抽取应用中亟待解决的重要问题。运用中文自然语言处理和信息抽取系统识别命名实体和实体关系,生成实体信息对象(Entity Profile),采用实体信息对象(EP)中的个人信息特征,实体关系和上下文相关信息在Hadoop平台上基于凝聚的层次聚类方法解决了实体消歧问题。采用哈尔滨工业大学整理的全网新闻语料作为人名消歧训练和测试数据,着重研究了中文人名消歧特征的选取,参数的确定和验证,在训练集和测试集上分别取得了91.33%和88.73%的F值。说明提出的方法具有较好的可行性。
关键词: 人名消歧, 信息抽取, 相似度, 层次聚类
ZHANG Feifei1, LI Zonghai2, ZHOU Xiaohui1, LI Xiaoge1,2. Cross-document Chinese personal name entity disambiguation based on hierarchical clustering[J]. Computer Engineering and Applications, 2014, 50(6): 106-111.
张菲菲1,李宗海2,周晓辉1,李晓戈1,2. 基于层次聚类的跨文本中文人名消歧研究[J]. 计算机工程与应用, 2014, 50(6): 106-111.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/
http://cea.ceaj.org/EN/Y2014/V50/I6/106