Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (12): 139-141.

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Improved CwkNN classification algorithm for un-balanced data

LIAO Zhi-fang1,CHEN Yu-zhou1,FAN Xiao-ping1,QU Zhi-hua1,2   

  1. 1.School of Information Science and Engineering,Central South University,Changsha 410075,China
    2.Department of Electrical and Computer Engineering,University of Central Florida,Orlando,FL 32816,USA
  • Received:2007-10-23 Revised:2007-12-13 Online:2008-04-21 Published:2008-04-21
  • Contact: LIAO Zhi-fang

面向非平衡混合数据的改进计数最近邻分类算法

廖志芳1,陈宇宙1,樊晓平1,瞿志华1,2   

  1. 1.中南大学 信息科学与工程学院,长沙 410075
    2.美国中佛罗里达大学 电子与计算机工程系,奥兰多 FL 32816
  • 通讯作者: 廖志芳

Abstract: The un-balanced data means that the numbers of samples in different class are not the same in the datasets,or even differ largely.And the sample sets contain different data types,such as ordinal and nominal data,these elements should be taken into consideration when processing the datasets.Though CwkNN can deal with the mixture data properly,the algorithm can not process the un-balanced data properly.So this paper proposes the Overall Density and the K-Local Density to increase the weight of minor samples,and then we hope that they can improve the classification accuracy.Experiments show that the classifying accuracy of the Overall Density is almost the same as the CwkNN,and the K-Local Density classification algorithm can increase the accuracy to some extent.

Key words: Counting-based weighted kNN algorithm(CwkNN), un-balanced data, overall density, K-local density

摘要: 非平衡混合数据是指数据集中类别不同的样本在数量上存在着较大的差别;同时样本数据集中的数据是非单一的数据类型,即它包含多种类型,如数值型和文本型数据。在对混合型数据的分类算法中,计数最近邻分类算法(CwkNN)可以有效地对混合型数据进行分类,但该算法对数据的非平衡性处理效果不是太理想。在CwkNN的基础之上结合数据的非平衡性特点提出了基于全局密度和K-密度的分类算法来提高少数类样本的权重,从而提高数据的分类精确度。实验结果表明,全局密度分类算法和CwkNN算法的分类精度相当,K-局部密度分类算法在一定程度上提高了分类的精度。

关键词: 计数最近邻分类算法, 非平衡数据, 全局密度, K-密度