Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (23): 216-221.DOI: 10.3778/j.issn.1002-8331.1808-0247

Previous Articles     Next Articles

Parallel Distributed Web Access Patterns Two-Layer Clustering

JIA Xiaoli, WU Rui, WU Siying   

  1. School of Mathematics and Computer, Shanxi Normal University, Linfen, Shanxi 041004, China
  • Online:2019-12-01 Published:2019-12-11

并行分布式的Web访问模式双层聚类

贾晓莉,吴瑞,吴思颖   

  1. 山西师范大学 数学与计算机科学学院,山西 临汾 041004

Abstract: Web log mining analyzes user access patterns to gain users’ level of interest. Currently, most web log mining is based on frequency, but the information that it mines is not of much value. In this paper, the proposed clustering technique is based on access time, firstly, the fuzzy vector is used to represent the user access patterns, recording whether the user has visited the page and the time of browsing. Then, the users’ access sequences are analyzed by different clustering methods. In addition, a two-layer clustering technique is proposed based on the fuzzy rough [k]-means and angle cosine, which can reduce the sensitivity to the initial clustering center. And the feasibility of the clustering method is demonstrated by a series of experiments. The results of different clustering methods are verified by using the Davies-Bouldin index. When the data sets are too large, the algorithm is inefficient. Therefore, it uses MapReduce to realize the parallelism of two-layer clustering, improving the efficiency of clustering.

Key words: web mining, fuzzy rough clustering, web access patterns, angle cosine, parallel

摘要: Web日志挖掘可以通过对用户访问模式进行分析,以获取用户的访问兴趣程度。目前,大多数的web日志挖掘是基于频率的,其挖掘的信息没有太大的价值。而提出的聚类技术是基于访问时间的,使用模糊向量表示用户浏览模式,记录用户是否浏览过该页面以及停留的时间。通过不同的聚类方法对用户的访问序列进行聚类分析。将模糊粗糙[k]-均值和夹角余弦相结合,提出了一种双层聚类技术,减少了对初始聚类中心的敏感性,并且通过一系列实验,论证了该聚类方法的可行性。而且,实验通过使用Davies-Bouldin指标来验证不同聚类方法的效果并进行比较。由于数据量大时,仍然存在算法效率低的问题,因此,使用MapReduce实现双层聚类的并行化,提高了聚类的效率。

关键词: web挖掘, 模糊粗糙聚类, web访问模式, 夹角余弦, 并行