计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (23): 160-162.DOI: 10.3778/j.issn.1002-8331.2008.23.049

• 数据库、信号与信息处理 • 上一篇    下一篇

基于概率潜在语义分析的Web用户聚类

俞 辉,景海峰   

  1. 中国石油大学 计算机与通信工程学院,山东 东营 257061
  • 收稿日期:2007-10-15 修回日期:2008-01-28 出版日期:2008-08-11 发布日期:2008-08-11
  • 通讯作者: 俞 辉

Web user clustering based on Probabilistic Latent Semantic Analysis

YU Hui,JING Hai-feng   

  1. Institute of Computer & Communication Engineering,China University of Petroleum,Dongying,Shandong 257061,China
  • Received:2007-10-15 Revised:2008-01-28 Online:2008-08-11 Published:2008-08-11
  • Contact: YU Hui

摘要: Web用户聚类知识可以为改进信息搜索效率和提供个性化服务提供帮助。通过对海量日志记录分析,构建会话-页面矩阵;根据信息论理论,在会话-页面矩阵中权值计算中考虑局部和全局权值贡献;利用概率潜在语义分析将隐式变量Z对页面P的条件概率转换为隐式变量Z对会话S的条件概率,然后在聚类分析中以此作为相似度计算依据。聚类算法采用了基于距离的k-medoids算法,以进一步改善聚类精度。实验结果验证了该算法的有效性和局限性。

关键词: Web日志, 预处理, Web用户, 概率潜在语义分析, 聚类

Abstract: Knowledge of Web user clustering can improve the efficiency of information searching and personalized service.Firstly,session-page matrix can been constructed by analyzing a great deal of log.Then,based on information theory,the local weight and global weight are considered in calculation of weight in session-page matrix.With usage of probabilistic latent semantic analysis,the conditional probability of latent variable Z to page P is transformed the conditional probability of latent variable Z to session S,then the transformed results are used in similarity calculation.The k-medoids algorithm is adopted to further improve clursting result.Experiment results validate validity and limitation of this algorithm.

Key words: Web log, preprocessing, Web user, Probabilistic Latent Semantic Analysis(PLSA), clustering