计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (21): 72-76.DOI: 10.3778/j.issn.1002-8331.1605-0266

• 大数据与云计算 • 上一篇    下一篇

基于邻域粗糙集的实体分辨记录对划分

周  星1,刁兴春2,曹建军2   

  1. 1.解放军理工大学 指挥信息系统学院,南京 210007
    2.南京电讯技术研究所,南京 210007
  • 出版日期:2017-11-01 发布日期:2017-11-15

Record pairs partition for entity resolution based on neighborhood rough set

ZHOU Xing1, DIAO Xingchun2, CAO Jianjun2   

  1. 1. School of Command Information System, PLA University of Science and Technology, Nanjing 210007, China
    2. Nanjing Telecommunication Technology Institute, Nanjing 210007, China
  • Online:2017-11-01 Published:2017-11-15

摘要: 现有的实体分辨方法在准确性和效率上各有所长,将易分辨和难分辨的记录对分开,为下一步分别应用不同分辨方法提供基础。对待划分的记录对,利用变精度邻域粗糙集分别计算相似记录对和不相似记录对的上下近似集,得到全体记录对的上下近似集及对应的边界,处于边界域的记录对即为难分辨的记录对,其余为易分辨的记录对。分析了变精度邻域粗糙集中的包含度阈值和距离阈值对于记录对划分的影响。利用实验比较难分辨、易分辨和原始记录对在利用相似度阈值分类和利用KNN分类时的准确性,说明了划分的有效性。

关键词: 实体分辨, 记录对划分, 粗糙集

Abstract: The present approaches of entity resolution vary in effectiveness and efficiency, normal record pairs and ambiguous record pairs are separated, so that different approaches can be applied to them. As to the record pairs to be partitioned, variable precision neighborhood rough set is used to compute the lower and upper approximation of similar record pairs and dissimilar record pairs respectively, to get the approximation sets and boundary region of all record pairs, and those record pairs in the boundary region are regarded as ambiguous, the rest are normal. How the thresholds of inclusion degree and distance in the variable precision neighborhood rough set affect the effectiveness of data partition is analyzed. Experiments are conducted to compare the accuracy of the normal, ambiguous and original record pairs while using similarity threshold and KNN to resolute, showing the effectiveness of partition.

Key words: entity resolution, record pair partition, rough set