Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (19): 87-95.DOI: 10.3778/j.issn.1002-8331.1807-0153

Previous Articles     Next Articles

Reserch of Entity Matching Based on Multiple Heterogenous Data

WANG Lingyang, CHEN Qinkuang, SHOU Lidan, CHEN Ke   

  1. 1.College of Computer Science and Technology, Zhejiang University, Hangzhou 310000, China
    2.Key Laboratory of Big Data Intelligent Computing of Zhejiang Province, Zhejiang University, Hangzhou 310027, China
  • Online:2019-10-01 Published:2019-09-30

多源异构数据的实体匹配方法研究

王凌阳,陈钦况,寿黎但,陈珂   

  1. 1.浙江大学 计算机科学与技术学院,杭州 310000
    2.浙江大学 大数据智能计算重点实验室,杭州 310027

Abstract: In recent years, for the entity matching problem of multi-source heterogeneous data, many scholars have proposed different solutions. However, these methods usually focus on entity matching under semantic frameworks such as RDFS or OWL. In addition, when facing multiple data source entity matching problem, most current methods will regard it as a two data source matching problem. These methods not only have high computational complexity, but also do not analyze the entity data from multiple aspects. To address this issue, the paper proposes an entity matching method which uses the commonly existing names, attributes, and context information of entities to construct multiple indexes, which can reduce the space complexity and generate high-quality candidate sets. This paper also proposes a method for calculating the similarity of entities, which effectively determining whether entity pair matches. According to the weights and mutual exclusion relations between entities, it proposes an optimization algorithm based on graph division and divides equivalent entities into the same set. Experiments are conducted on real-world datasets of brand and character categories in the business domain, and the experimental results show that this method can achieve good improvements.

Key words: entity matching, knowledge base, multiple heterogenous data, graphic partitioning

摘要: 近年来,针对多源异构数据的实体匹配问题,已经有诸多学者提出不同的解决方法。然而,这些方法几乎都集中在RDFS或OWL等语义框架下进行实体匹配,不具有通用性。此外,针对多数据源实体匹配问题,目前主流解决方式是将其转换为多组两两数据源的实体匹配问题,该种方式直接进行两两匹配的计算复杂度过高,且没有从多数据源全局的角度分析问题。从这些问题出发,提出了一种的实体匹配方法,利用了实体中普遍存在的名称、属性和上下文信息,构建多种索引,缩减计算空间同时生成高质量的候选集;还定义了度量实体相似度的计算方法,有效地判别了实体对是否匹配。并根据实体间边的权重以及互斥关系,提出一种基于图划分的优化算法,划分多个等价实体构成的集合。从互联网中抓取商业领域下品牌和人物类别的真实数据进行实验测试,实验结果表明该方法取得了良好的效果。

关键词: 实体匹配, 知识库, 多源异构数据, 图划分