Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (34): 111-114.DOI: 10.3778/j.issn.1002-8331.2010.34.034

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Integrated clustering algorithm based on hybrid of SOM and improved PSO for Web document

SONG Jian-jie1,WANG Wei2   

  1. 1.Department of Electronics Technology and Informatioin,Science and Technology College of Hunan,Changsha 410004,China
    2.School of Information Science and Engineering,Central South University,Changsha 410083,China
  • Received:2010-04-20 Revised:2010-06-30 Online:2010-12-01 Published:2010-12-01
  • Contact: SONG Jian-jie

融合SOM和改进PSO的Web文档集成聚类算法

宋剑杰1,王 伟2   

  1. 1.湖南科技职业学院 电子信息系,长沙 410004
    2.中南大学 信息科学与工程学院,长沙 410083

  • 通讯作者: 宋剑杰

Abstract: With the explosive growth of Web information in Internet,it seems that the current search engines cannot meet the requirement of users in many aspects.By grouping similar Web documents into clusters,the search space can be reduced,the search accelerated,and its precision improved.An integrated clustering algorithm for Web document is proposed in this paper,which combines SOM to realize coarse clustering and the improved PSO to realize fine clustering.Firstly,the Web document is expressed as feature lemma and its weight by the vector space model.Secondly,the SOM algorithm is used to realize coarse clustering of the document feature set and a group of output weights can be obtained.Then the improved PSO algorithm is initialized with the output weights and fine clustering can be realized by the algorithm evolution,thus Web document clustering is implemented finally.Simulation result shows that the algorithm can greatly improve the precision and recall of document searching,and have certain practical value.

摘要: 随着信息的爆炸式增长,现有的搜索引擎在很多方面不能满足人们的需要。Web文档聚类可以减小搜索空间,加快检索速度,提高查询精度。提出了一种融合SOM(Self-Organizing Maps)粗聚类和改进PSO(Particle Swarm Optimization)细聚类的Web文档集成聚类算法。首先根据向量空间模型表示法,用特征词条及其权值表示Web文档信息,其次用SOM算法对文档特征集进行粗聚类,得到一组输出权值,然后用这组权值初始化改进的PSO算法,用改进PSO算法对此聚类结果进行细化,最终实现Web文档聚类。仿真结果表明,该算法能有效提高文档查询的查准率和查全率,具有一定的实用价值。

CLC Number: