Reserch of Entity Matching Based on Multiple Heterogenous Data

doi:10.3778/j.issn.1002-8331.1807-0153

Abstract

Abstract: In recent years, for the entity matching problem of multi-source heterogeneous data, many scholars have proposed different solutions. However, these methods usually focus on entity matching under semantic frameworks such as RDFS or OWL. In addition, when facing multiple data source entity matching problem, most current methods will regard it as a two data source matching problem. These methods not only have high computational complexity, but also do not analyze the entity data from multiple aspects. To address this issue, the paper proposes an entity matching method which uses the commonly existing names, attributes, and context information of entities to construct multiple indexes, which can reduce the space complexity and generate high-quality candidate sets. This paper also proposes a method for calculating the similarity of entities, which effectively determining whether entity pair matches. According to the weights and mutual exclusion relations between entities, it proposes an optimization algorithm based on graph division and divides equivalent entities into the same set. Experiments are conducted on real-world datasets of brand and character categories in the business domain, and the experimental results show that this method can achieve good improvements.

Key words: entity matching, knowledge base, multiple heterogenous data, graphic partitioning

摘要： 近年来，针对多源异构数据的实体匹配问题，已经有诸多学者提出不同的解决方法。然而，这些方法几乎都集中在RDFS或OWL等语义框架下进行实体匹配，不具有通用性。此外，针对多数据源实体匹配问题，目前主流解决方式是将其转换为多组两两数据源的实体匹配问题，该种方式直接进行两两匹配的计算复杂度过高，且没有从多数据源全局的角度分析问题。从这些问题出发，提出了一种的实体匹配方法，利用了实体中普遍存在的名称、属性和上下文信息，构建多种索引，缩减计算空间同时生成高质量的候选集；还定义了度量实体相似度的计算方法，有效地判别了实体对是否匹配。并根据实体间边的权重以及互斥关系，提出一种基于图划分的优化算法，划分多个等价实体构成的集合。从互联网中抓取商业领域下品牌和人物类别的真实数据进行实验测试，实验结果表明该方法取得了良好的效果。

关键词: 实体匹配, 知识库, 多源异构数据, 图划分

WANG Lingyang, CHEN Qinkuang, SHOU Lidan, CHEN Ke. Reserch of Entity Matching Based on Multiple Heterogenous Data[J]. Computer Engineering and Applications, 2019, 55(19): 87-95.

王凌阳，陈钦况，寿黎但，陈珂. 多源异构数据的实体匹配方法研究[J]. 计算机工程与应用, 2019, 55(19): 87-95.

[1]	DU Yufei, WU Baoguo, CHEN Dong. Study of Trees and Shrubs Recognition Inference Algorithm Based on Production Rules [J]. Computer Engineering and Applications, 2020, 56(5): 242-250.
[2]	LEI Yuxia, CHEN Juan, HAN Yonghua, WANG Xiangde. Analysis and revision of knowledge inconsistency in frames [J]. Computer Engineering and Applications, 2016, 52(22): 155-158.
[3]	YANG Long1，2, ZHANG Gongrang1，2, WANG Li1，2, WEI Yanyan1，2. Integration of multiple knowledge bases based on knowledge blocks [J]. Computer Engineering and Applications, 2014, 50(7): 129-132.
[4]	ZHAO Jiaojiao, ZHAO Shuliang, GUO Xiaobo, LIU Jundan. Visualization method of association rules based on natural language generation [J]. Computer Engineering and Applications, 2014, 50(23): 122-126.
[5]	LI Ming1, XU Dezhi2, YU Zhiqiang3. Approach for expanding description logic for improving role description ability [J]. Computer Engineering and Applications, 2012, 48(12): 116-119.
[6]	ZHOU Yihui1，2，ZAN Hongying2，MU Lingling2. Rules of modality’s usages for multi-corpus [J]. Computer Engineering and Applications, 2011, 47(28): 135-138.
[7]	SHI Yong-sheng，CAI Shu-yu，SONG Yun-xue. Study on non-redundant storage technology of dynamic knowledge in real-time diagnosis [J]. Computer Engineering and Applications, 2010, 46(9): 246-248.
[8]	XIE Ting-ting，LI Wei-hua. Design and implementation of special ETL pattern [J]. Computer Engineering and Applications, 2010, 46(35): 133-135.
[9]	YAN Cai-rong¹，SUN Gui-ning²，GAO Nian-gao². Mass data cleaning algorithm based on extended tree-like knowledge base [J]. Computer Engineering and Applications, 2010, 46(28): 146-148.
[10]	WANG Su-ge，SONG Xiao-lei，LI Hong-xia. Method for question answer pair extraction based on domain knowledge [J]. Computer Engineering and Applications, 2010, 46(19): 214-216.
[11]	MA Chun-hua¹,ZHU Hao-dong^2,3. Efficient method of automatically obtaining features [J]. Computer Engineering and Applications, 2009, 45(17): 129-132.
[12]	REN Zhi-bin,RUAN Yi. Obstacle-navigation control of inspection robot for power transmission lines based on knowledge base [J]. Computer Engineering and Applications, 2008, 44(3): 236-239.
[13]	LI Ai-ping,WAN Gao-xu,LIN Xian-kun. Application of rule and ANN in establishment of knowledge base for determination of milling parameters [J]. Computer Engineering and Applications, 2008, 44(1): 175-178.
[14]	LIU Ling,ZHANG Yong,LI Ming,YANG De-san. Study of dynamic knowledge bases and concept lattices applied to intelligence disease diagnosis [J]. Computer Engineering and Applications, 2007, 43(28): 233-236.
[15]	WANG Ya-lin，SHU Jun，YANG Chun-hua，GUI Wei-hua. Knowledge base on-line maintenance for blending expert system in production of alumina based on similarity measure [J]. Computer Engineering and Applications, 2007, 43(20): 200-203.

Reserch of Entity Matching Based on Multiple Heterogenous Data

多源异构数据的实体匹配方法研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics