计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (5): 61-64.

• 大数据与云计算 • 上一篇    下一篇

网络用户访问模式挖掘算法研究

武  健   

  1. 北京信息职业技术学院 计算机工程系,北京 100018
  • 出版日期:2016-03-01 发布日期:2016-03-17

Methods for data mining of Internet users accessing and browsing pattern

WU Jian   

  1. Department of Computer Engineering, Beijing Information Technology College, Beijing 100018, China
  • Online:2016-03-01 Published:2016-03-17

摘要: 针对高校校园网受考生及家长关注度越来越高的现象,为深入分析和理解用户的访问模式及其访问热点的变化规律等知识,设计一种隐马尔科夫模型和分层聚类策略相结合的混合聚类算法。基于隐马尔科夫模型将时序数据转换到似然空间,其中似然度的大小通过对称性KL(Kullback-Leibler)距离来标识。构建对称性KL转移矩阵,并借助于分层聚类方法实现对用户访问模式进行聚类。通过将该方法应用于考生及家长对我校官网访问的网络日志数据挖掘进而得到用户访问的三种模式,表明该方法的可行性和有效性。

关键词: 日志数据, 数据挖掘, 隐马尔科夫模型, 聚类

Abstract: Based on the more and more frequent visiting to the official website of the colleges or universities by candidates for colleges and their parents, it is very useful for the improvement of the website to understand the internet users’ accessing purpose and browsing behaviors. This paper combines the hidden Markov model and hierarchical clustering to perform the data mining of dynamic web log data. The original data are transformed by extension of the hidden Markov model and Symmetric Kullback-Leibler (SKL) distance into probabilistic space. Using hierarchical clustering on the SKL confusion matrix, the time series data can be clustered. This method is verified with a dynamic log data of Internet users’ accessing and browsing behaviors lasting for 2 months when the candidates for college and their parents are looking for a proper university to enter. The result shows that there are two patterns of users’ behaviors. This indicates that the method has a very good performance in feasibility and effectiveness.

Key words: log data, data mining, hidden Markov model, clustering