计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (27): 136-141.

• 数据库、信号与信息处理 • 上一篇    下一篇

集成多种特征匹配中文实体名称

巩  军   

  1. 1.北京大学 信息技术学院 博士后流动站,北京 100871
    2.神华集团 博士后工作站,北京 100011
  • 出版日期:2012-09-21 发布日期:2012-09-24

Matching Chinese entity names with multiple features

GONG Jun   

  1. 1.Mobile Post-doctoral Station, School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
    2.Postdoctoral Centre, Shenhua Group, Beijing 100011, China
  • Online:2012-09-21 Published:2012-09-24

摘要: 准确匹配实体名称在信息系统集成中有广泛的应用,而在中文环境中,实体名称的变化和笔误使得中文实体名称难以准确匹配,所以需要开发出适应这些变化和笔误的匹配方法。中文实体名称的相似度从字、词、语义三个层次计算出来,将这些相似度线性合并起来,集成各自的优势。为了利用更多的匹配特征,引入了两种机器学习的方法:第一种方法通过训练获得一个优化排序和最佳切分点;第二种方法利用支持向量机来判断两个名称是否指向同一实体。在中文实体名称的数据集上的实验表明,这些方法和特征有效提高了匹配的效果。

关键词: 字符串相似度, 名字消歧, 名字匹配, 机器学习

Abstract: Entity name matching plays an important role in information system integration applications, while the name variations and clerical errors in Chinese entity names make exact string matching problematic. Therefore it is important to develop methodologies that can handle the different variants of the same name entity. The Chinese entity name similarity is measured based on character, word and semantic levels separately, and a hybrid solution is introduced by combining these similarities linearly. Two machine learning methods are developed to integrate editing features for more precise matching: the optimized ranking list and best cut point are achieved from a training process; a Support Vector Machine is used to judge the name pairs. The results of an experimental study on a real dataset of Chinese entity names are reported; the experiment results show the methods are effective.

Key words: string similarity, name disambiguation, name-matching, machine learning