计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (1): 76-83.DOI: 10.3778/j.issn.1002-8331.1710-0230

• 大数据与云计算 • 上一篇    下一篇

TSCAN:利用并行策略改进的图结构聚类算法

陈亚中1,李振军1,李荣华1,毛  睿1,乔少杰2   

  1. 1.深圳大学 计算机与软件学院,广东 深圳 518060
    2.成都信息工程大学,成都 610103
  • 出版日期:2019-01-01 发布日期:2019-01-07

Improved Graph Structure Clustering Algorithm by Using Parallel Strategy

CEHN Yazhong1, LI Zhenjun1, LI Ronghua1, MAO Rui1, QIAO Shaojie2   

  1. 1.College of Computer Science & Software Engineering, Shenzhen University, Shenzhen, Guandong 518060, China
    2.Chengdu University of Information Technology, Chengdu 610103, China
  • Online:2019-01-01 Published:2019-01-07

摘要: 近年来,图数据聚类在学术界引起了广泛的关注,许多优秀的聚类方法,如模块度优化算法、谱聚类,以及基于密度的聚类算法在图数据上取得了很好的效果。SCAN是一种著名的基于密度的图聚类算法,该算法不仅能够找出图中的聚类,而且还能够发现不同聚类间的Hub节点,以及图中的离群点。然而,该算法存在两方面的局限性:首先,在大规模图数据上,该算法需要耗费大量的时间用于计算图中每条边的结构相似性;另一方面,该算法存在两个参数[ε]和[μ],并且对这两个参数比较敏感。为了解决其局限性,提出了一种基于OpenMP的并行算法来求解节点相似性,并且提出了两种有效的负载均衡策略;其次,提出一种基于三角形的新型图结构聚类算法TSCAN。该模型能够有效降低算法对参数的敏感性,而且还能够发现重叠以及更稠密的社区。在多个大规模数据集上实验发现,基于多核的并行算法能够达到近乎线性的加速比,而且TSCAN算法对参数不敏感,能有效发现重叠社区。

关键词: 社区探测, 结构聚类算法, 重叠社区, OpenMP, 并行算法

Abstract: Recently, graph clustering has been attracted much attention in the research community. There are a number of clustering methods, such as modularity optimization algorithm, spectral clustering algorithm, as well as density-based clustering algorithm, that are proved to be useful for graph data. Among those algorithms, the SCAN algorithm is a well-known density-based algorithm for graph data. SCAN is not only able to find clusters, but it also identifies hub nodes and outliers. The SCAN algorithm, however, has two limitations. Firstly, it is very costly to compute the structural similarity for each edge in a massive graph. Secondly, SCAN is sensitive to its parameters [ε] and [μ]. To overcome these two limitations, this paper proposes an OpenMP-based parallel algorithm with two carefully-designed load-balancing techniques to compute the similarities efficiently, and a novel triangle-based graph structural clustering algorithm, called TSCAN is proposed. The striking feature of the TSCAN algorithm is that it is not sensitive to the parameters, and it is also capable of finding overlapping and dense communities. Finally, this paper conducts extensive experiments over several real-world datasets to evaluate the proposed algorithms. The results indicate that parallel implementation can achieve near-linear speedup, and the TSCAN algorithm is robust to its parameters and can identify overlapping communities.

Key words: community detection, SCAN algorithm, overlapping community, openMP, parallel algorithm