计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (11): 123-127.

• 数据库、信号与信息处理 • 上一篇    下一篇

基于聚类准则函数的改进K-means算法

张雪凤1,张桂珍2,刘 鹏1,2   

  1. 1.上海财经大学 信息管理与工程学院,上海 200433
    2.上海财经大学 继续教育学院,上海 200080
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-04-11 发布日期:2011-04-11

Improved K-means algorithm based on clustering criterion function

ZHANG Xuefeng1,ZHANG Guizhen2,LIU Peng1,2   

  1. 1.School of Information Management and Engineering,Shanghai University of Finance & Economics,Shanghai 200433,China
    2.School of Continuing Education,Shanghai University of Finance & Economics,Shanghai 200080,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-04-11 Published:2011-04-11

摘要: K-means算法所使用的聚类准则函数是将数据集中各个簇的误差平方值直接相加而得到的,不能有效处理簇的密度不均且大小差异较大的数据集。为此,将K-means算法的聚类准则函数定义为加权的簇内标准差之和,权重为簇内数据对象数占总数目的比例。同时,调整了传统K-means算法将数据对象重新分配给簇的方法,采用一个数据对象到中心点的加权距离代替传统K-means算法中的距离,将数据对象分配给使加权距离最小的中心点所在的簇。实验结果表明,针对模拟数据集的聚类,改进K-means算法可以明显减少大而稀的簇中数据对象被错误地分配到相邻的小而密簇的可能性,改善了聚类的质量;针对UCI数据集的聚类,改进算法使得各个簇更为紧凑,从而验证了改进K-means算法的有效性。

关键词: K-means算法, 簇, 聚类准则函数

Abstract: The criterion function used in K-means algorithm is the sum of the squared error,which may not work well for dataset containing clusters with different sizes and densities.In this study,the criterion function is improved by being defined as the sum of the weighted standard deviation,and the weight is the ratio of the number of points in each cluster to the whole points.The way each point being assigned to the centroid in the K-means algorithm is also modified:Instead of being assigned to the closest centroid,each point is assigned to the centroid which has minimum weighted distance.Experiments on simulation datasets show that the improved K-means algorithm significantly enhances the clustering quality by reducing the probability of misclassifying the points of big sparse clusters to its neighboring compact clusters.Experiments on UCI datasets show that the improved algorithm can obtain more compact cluster.Therefore,the improved K-means algorithm is effective.

Key words: K-means algorithm, cluster, clustering criterion function