Fast hybrid clustering for Web documents

doi:10.3778/j.issn.1002-8331.2010.22.005

Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (22): 12-15.DOI: 10.3778/j.issn.1002-8331.2010.22.005

• 博士论坛 • Previous Articles Next Articles

Fast hybrid clustering for Web documents

YANG Rui-long¹，ZHU Qing-sheng¹，XIE Hong-tao^1，2

1.College of Computer Science，Chongqing University，Chongqing 400044，China
2.Logistical Engineering University，Chongqing 400016，China

Received:2010-04-02 Revised:2010-05-28 Online:2010-08-01 Published:2010-08-01
Contact: YANG Rui-long

快速混合Web文档聚类

杨瑞龙¹，朱庆生¹，谢洪涛^1，2

1.重庆大学计算机学院，重庆 400044
2.后勤工程学院，重庆 400016

通讯作者: 杨瑞龙

Abstract

Abstract: A fast hybrid clustering algorithm for Web documents clustering is proposed which optimizes the initial center values of K-means algorithm through STC algorithm.Firstly，the initial center values are extracted after the Web document set is clustered by STC algorithm.Secondly，by mapping the each internal node of suffix tree into M-dimensional VSM，each feature term weights is computed using TF-IDF extended with phrases.Finally，the final result is generated by K-means algorithm.The evaluation experiments indicate that the new hybrid algorithm is more effective on clustering documents than ordinary K-means and STC algorithm.Moreover，it is as fast as K-means and STC algorithm.

摘要： 提出了一种使用后缀树聚类算法优化K-means文档聚类初始值的快速混合聚类方法STK-means。该方法首先构建文档集的后缀树模型，使用后缀树聚类算法识别初始聚类、提取K-means聚类算法初始值中心值。然后，把后缀树模型的节点映射到M维向量空间模型中的特征项，利用TF-IDF方案计算基于短语的文档向量特征值。最后，使用K-means算法产生聚类结果。实验结果表明该方法优于传统K-means聚类算法和后缀树聚类算法，并具备了这些算法聚类速度快的优点。

CLC Number:

TP391

YANG Rui-long¹，ZHU Qing-sheng¹，XIE Hong-tao^1，2. Fast hybrid clustering for Web documents[J]. Computer Engineering and Applications, 2010, 46(22): 12-15.

杨瑞龙¹，朱庆生¹，谢洪涛^1，2. 快速混合Web文档聚类[J]. 计算机工程与应用, 2010, 46(22): 12-15.

[1]	CHEN Wang¹，LI Bo1，SHI Yanjun²，TENG Hongfei². Differential evolution algorithm with estimation of distribution for solving RCPSP problem [J]. Computer Engineering and Applications, 2011, 47(4): 1-4.
[2]	SHA Quanyou¹，SHI Jinfa¹，QIN Xiansheng². Research on dynamical decomposition and optimization configuration in aeronautic manufacturing field [J]. Computer Engineering and Applications, 2011, 47(4): 9-12.
[3]	DAI Qin，LIU Jianbo，LIU Shibin. Analysis of remote sensing information extraction using swarm intelligence method [J]. Computer Engineering and Applications, 2011, 47(4): 13-16.
[4]	LIU Guangshuai，LI Bailin，HE Chaoming. Patch-graph sparse optimization methods based on piecewise smooth surfaces reconstruction [J]. Computer Engineering and Applications, 2011, 47(4): 22-25.
[5]	LONG Yinfang，SHANG Junna. Frequency offset estimation for MC-CDMA systems [J]. Computer Engineering and Applications, 2011, 47(4): 102-104.
[6]	YU Jiangde¹，WANG Xijie¹，FAN Xiaozhong². Comparing of importance of above-context versus below-context for Chinese word segmentation [J]. Computer Engineering and Applications, 2011, 47(4): 117-120.
[7]	PEI Yingbo¹，LIU Xiaoxia². Study on improved CHI for feature selection in Chinese text categorization [J]. Computer Engineering and Applications, 2011, 47(4): 128-130.
[8]	ZHANG Yu，LUO Ke. OC-SVM-based classification for large-scale data sets [J]. Computer Engineering and Applications, 2011, 47(4): 131-133.
[9]	LIU Ronghui^1，2，ZHENG Jianguo¹. Clustering algorithm in Deep Web based on Chinese word segmentation [J]. Computer Engineering and Applications, 2011, 47(4): 138-140.
[10]	CAI Rangjia. Tibetan studies of corpus description method [J]. Computer Engineering and Applications, 2011, 47(4): 146-148.
[11]	LIU Xiuling，LIU Jing，WANG Hongrui，GUO Lei. Fast collision detection based on improved honeycomb-shape spatial decomposition [J]. Computer Engineering and Applications, 2011, 47(4): 149-153.
[12]	ZHANG Cong，GUI Zhiguo. Non-linear image sharpening approach based on noise estimation [J]. Computer Engineering and Applications, 2011, 47(4): 154-156.
[13]	FU Xiaojun¹，GUO Pengjiang¹，GUO Jing²，FENG Jun². 3D model classification based on statistical features and Markov models [J]. Computer Engineering and Applications, 2011, 47(4): 157-159.
[14]	CHEN Huijie，LAI Huicheng，JIA Zhiqiang. Double color image information hiding based on image mix and wavelet transform [J]. Computer Engineering and Applications, 2011, 47(4): 171-173.
[15]	YANG Xiaoqin，JI Xiaoyong. Fast motion estimation algorithm based on H.264 [J]. Computer Engineering and Applications, 2011, 47(4): 174-175.

Fast hybrid clustering for Web documents

快速混合Web文档聚类

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics