Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (11): 172-177.DOI: 10.3778/j.issn.1002-8331.1604-0206

Previous Articles     Next Articles

CRS-KNN text classification algorithm based on Canopy and rough set

YAO Binxiu1, NI Jiancheng2, YU Pingping1, CAO Bo1, LI Linlin1   

  1. 1.College of Information Science and Engineering, Qufu Normal University, Rizhao, Shandong 276800, China
    2.College of Software, Qufu Normal University, Qufu, Shandong 273100, China
  • Online:2017-06-01 Published:2017-06-13

一种基于Canopy和粗糙集的CRS-KNN文本分类算法

姚彬修1,倪建成2,于苹苹1,曹  博1,李淋淋1   

  1. 1.曲阜师范大学 信息科学与工程学院,山东 日照 276800
    2.曲阜师范大学 软件学院,山东 曲阜 273100

Abstract: Focused on the problem that the classification efficiency of KNN algorithm is gradually reduced with the increase of training set size and feature dimension, the CRS-KNN text classification algorithm based on Canopy and rough set is proposed in this paper. Firstly, the text data to be processed is clustered by Canopy. For each obtained cluster, upper and lower approximate segmentation with rough set theory is taken. The lower approximate area obtained by dividing does not need classification, but the border area which is acquired by the difference of upper and lower approximate needs final classification by KNN algorithm. Experimental results show that the proposed algorithm reduces the size of the data computing about KNN algorithm, and improves the classification efficiency. At the same time, the accuracy rate, recall rate and [F1] value are improved compared with the traditional KNN algorithm and improved KNN text classification algorithm based on clustering.

Key words: Canopy clustering, rough set, [k]-Nearest Neighbor(KNN) algorithm, text classification

摘要: 针对KNN算法的分类效率随着训练集规模和特征维数的增加而逐渐降低的问题,提出了一种基于Canopy和粗糙集的CRS-KNN(Canopy Rough Set-KNN)文本分类算法。算法首先将待处理的文本数据通过Canopy进行聚类,然后对得到的每个类簇运用粗糙集理论进行上、下近似分割,对于分割得到的下近似区域无需再进行分类,而通过上、下近似作差所得的边界区域数据需要通过KNN算法确定其最终的类别。实验结果表明,该算法降低了KNN算法的数据计算规模,提高了分类效率。同时与传统的KNN算法和基于聚类改进的KNN文本分类算法相比,准确率、召回率和[F1]值都得到了一定的提高。

关键词: Canopy聚类, 粗糙集, [K]-最近邻(KNN)算法, 文本分类