Computer Engineering and Applications ›› 2015, Vol. 51 ›› Issue (6): 114-119.

Previous Articles     Next Articles

New text clustering algorithm based on CF tree and KNN graph partition

YANG Xiaofu, QI Jiandong, JI Pengfei, ZHU Wenfei   

  1. School of Information, Beijing Forestry University, Beijing 100083, China
  • Online:2015-03-15 Published:2015-03-13

一种CF树结合KNN图划分的文本聚类算法

仰孝富,齐建东,吉鹏飞,朱文飞   

  1. 北京林业大学 信息学院,北京 100083

Abstract: In order to improve the effect of text clustering, and to mend the flaws of traditional clustering algorithm in parameter setting and algorithm stability, a new text clustering algorithm TCBIBK(a Text Clustering algorithm Based on Improved BIRCH and K-nearest neighbor) is presented. TCBIBK uses BIRCH clustering algorithm as the prototype. During the process of clustering, besides analyzing the distance between text objects and clusters, TCBIBK also analyzes the distance between clusters and clusters, takes the active cluster merging or segmentation, and sets the dynamic threshold. Combined with KNN classification algorithm, TCBIBK improves the algorithm stability under the premise of ensuring the good efficiency of clustering. When applied to text clustering, TCBIBK can improve the text clustering effect. The results of comparative experiment shows that this algorithm can greatly improve the validity and stability of text clustering.

Key words: text clustering, vector space model, Balanced Iterative Reducing and Clustering using Hierarchies(BIRCH), K-nearest neighbor

摘要: 为了提升文本聚类效果,改善传统聚类算法在参数设定,稳定性等方面存在的不足,提出新的文本聚类算法TCBIBK(a Text Clustering algorithm Based on Improved BIRCH and K-nearest neighbor)。该算法以BIRCH聚类算法为原型,聚类过程中除判断文本对象与簇的距离外,增加判断簇与簇之间的距离,采取主动的簇合并或分裂,设置动态的阈值。同时结合KNN分类算法,在保证良好聚类效率前提下提升聚类稳定性,将TCBIBK算法应用于文本聚类,能够提高文本聚类效果。对比实验结果表明,该算法聚类有效性与稳定性都得到较大提高。

关键词: 文本聚类, 向量空间模型, 传统的且非常高效的层次聚类算法(BIRCH), K最近邻