计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (13): 88-92.DOI: 10.3778/j.issn.1002-8331.1706-0274

• 大数据与云计算 • 上一篇    下一篇

挖掘机构别名的Jaccard相似度数据空间转换方法

尚玉玲1,曹建军2,李红梅1,刘  艺1   

  1. 1.中国人民解放军理工大学 指挥信息系统学院,南京 210007
    2.国防科技大学 第六十三研究所,南京 210007
  • 出版日期:2018-07-01 发布日期:2018-07-17

Jaccard similarity based data space transform for organization alias mining

SHANG Yuling1, CAO Jianjun2, LI Hongmei1, LIU Yi1   

  1. 1.Commands Institute Information Systems, PLA University of Science and Technology, Nanjing 210007, China
    2.The 63rd Institute, National University of Defense Technology, Nanjing 210007, China
  • Online:2018-07-01 Published:2018-07-17

摘要: 针对同一机构实体对应多个机构名称的问题,提出了一种基于Jaccard相似度数据空间转换的机构别名挖掘方法。根据机构与作者间的隶属关系,建立机构-作者二部图模型;采用Jaccard相似度度量两机构名称所对应作者姓名集合间的相似度;根据机构间的相似度矩阵,将集合型数据转换成数值型数据;通过计算机构名称对应的相似度向量间的余弦相似度,实现了机构别名的有效挖掘。最后用真实数据进行对比实验验证了该方法的优越性。

关键词: 实体分辨, 机构别名, 数据空间转换, Jaccard相似度, 余弦相似度, 关系数据

Abstract: To solve the problem which the same organization entity has few names, a Jaccard Similarity based Data Space Transform for Organization Alias Mining(JS-DST-OAM) method is proposed. Based on the subjection relationship between organizations and authors, organization-author bipartite graph is built; Jaccard similarity is used to measure the similarity of two organization names by their author sets; based on the organization-organization similarity matrix, the transform from set data to numerical data is achieved; cosine similarity of organization name pairs is calculated by their similarity vectors, and it achieves the mining of organization alias. In the end, real data is used to verify its superiority.

Key words: entity resolution, organization alias, data space transform, Jaccard similarity, cosine similarity, relationship data