基于密度峰值多起始中心的融合聚类算法

doi:10.3778/j.issn.1002-8331.2010-0407

摘要/Abstract

摘要：

经典[K]-Means算法不能有效处理非球型数据集的聚类问题，且聚类目标数需预先指定。SMCL（Self-adaptive Multiprototype-based Competitive Learning）算法是一种[K]-Means的改进算法，它引入Multi-Prototypes机制，并将距离相近的Prototypes所代表的样本簇融合成聚类簇。在SMCL算法基础上提出DP-SMCL（Density Peak-SMCL）算法，使用密度峰值聚类算法确定初始聚类中心集，借助1-D高斯混合概率密度模型合并以Prototypes为中心的相近子簇来获得精确聚类结果。实验结果表明，DP-SMCL算法可应用于非球型数据集聚类，且能自动确认聚类的目标类别数，相比于[K]-Means和DBSCAN（Density-Based Spatial Clustering of Applications with Noise）等经典聚类算法能够获得更加准确的聚类结果。同时，与SMCL算法相比，DP-SMCL可以快速完成初始Prototypes的选定，显著提升算法准确率和执行效率。

关键词: [K]-Means, Multi-Prototypes, 聚类, 1-D高斯混合概率密度模型, 非球型数据集

Abstract:

The classical [K]-Means algorithm cannot effectively deal with the clustering problems of aspheric datasets and requires to specify the number of clusters manually. SMCL（Self-adaptive Multiprototype-based Competitive Learning） algorithm is an improved algorithm based on [K]-Means, which employs the Multi-Prototypes mechanism into the [K]-Means algorithm framework. Multi-Prototypes represent a series of preselected samples as the central point of the initial sub-clusters, and the sample sub-clusters with highly similarities are merged into a cluster based on a certain algorithm rule. This paper proposes an improved algorithm DP-SMCL（Density Peak-SMCL） which introduces the density peaks into the original SMCL algorithm. The density peak clustering algorithm is used to determine the Prototypes in the initial stage of clustering. The ultimate accurate clustering result is obtained by merging the sub-clusters with the 1-D Gaussian mixture probability density model which evaluates the similarity distance between two sub-clusters. The experimental results show that the DP-SMCL algorithm is suitable for aspheric datasets and can derive more precisely clustering results than [K]-Means algorithm and DBSCAN（Density-Based Spatial Clustering of Applications with Noise） algorithm. The DP-SMCL can automatically determine the inherent clustering target number. Compared with SMCL algorithm, DP-SMCL algorithm can rapidly pick out the initial prototypes while possesses a high accuracy of clustering results and a huge progress in algorithm efficiency.

Key words: [K]-Means, Multi-Prototypes, clustering, 1-D Gaussian mixture probability density model, aspheric dataset

梅婕，魏圆圆，许桃胜. 基于密度峰值多起始中心的融合聚类算法[J]. 计算机工程与应用, 2021, 57(22): 78-85.

MEI Jie, WEI Yuanyuan, XU Taosheng. Fusion Clustering Algorithm Based on Multi-Prototypes Using Density Peaks[J]. Computer Engineering and Applications, 2021, 57(22): 78-85.

参考文献

[1] MURTAGH F，CONTRERAS P.Algorithms for hierarchical clustering：an overview[J].Wiley Interdisciplinary Reviews：Data Mining and Knowledge Discovery，2012，2（1）：86-97.
[2] VELMURUGAN T.Efficiency of k-means and k-medoids algorithms for clustering arbitrary data points[J].International Journal of Computer Applications in Technology，2012，3（5）：1758-1764.
[3] 贾露，张德生，吕端端.物理学优化的密度峰值聚类算法[J].计算机工程与应用，2020，56（13）：47-53.
JIA L，ZHANG D S，LYU D D.Optimized density peak clustering algorithm in physics[J].Computer Engineering and Applications，2020，56（13）：47-53.
[4] KANUNGO T，MOUNT D M，NETANYAHU N S，et al.An efficient k-means clustering algorithm：analysis and implementation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2002，24（7）：881-892.
[5] 任永功，于戈.一种多维数据的聚类算法及其可视化研究[J].计算机学报，2005，28（11）：105-109.
REN Y G，YU G.A clustering algorithm for multi-dimensional data and its visualization research[J].Chinese Journal of Computers，2005，28（11）：105-109.
[6] DING C，HE X.K-means clustering via principal component analysis[C]//Proceedings of the 21st International Conference on Machine Learning，2004：29.
[7] LIKAS A，VLASSIS N，VERBEEK J J.The global k-means clustering algorithm[J].Pattern Recognition，2003，36（2）：451-461.
[8] 丁志成，葛洪伟.优化分配策略的密度峰值聚类算法[J].计算机科学与探索，2020，14（5）：792-802.
DING Z C，GE H W.Density peaks clustering with optimized allocation strategy[J].Journal of Frontiers of Computer Science and Technology，2020，14（5）：792-802.
[9] SANDER J，ESTER M，KRIEGEL H P，et al.Density-based clustering in spatial databases：the algorithm GDBSCAN and its applications[J].Data Mining and Knowledge Discovery，1998，2（2）：169-194.
[10] 马春来，单洪，马涛.一种基于簇中心点自动选择策略的密度峰值聚类算法[J].计算机科学，2016，43（7）：255-258.
MA C L，SHAN H，MA T.Improved density peaks based clustering algorithm with strategy choosing cluster center automatically[J].Computer Science，2016，43（7）：255-258.
[11] ESTER M，KRIEGEL H P，SANDER J，et al.A density-based algorithm for discovering clusters in large spatial databases with noise[J].Knowledge Discovery in Database，1996，96（34）：226-231
[12] RODRIGUEZ A，LAIO A.Clustering by fast search and find of density peaks[J].Science，2014，344（6191）：1492-1496.
[13] 瞿俊，姜青山，董槐林.基于高斯混合模型的层次聚类算法[J].计算机研究与发展，2006，43（S3）：321-327.
QU J，JIANG Q S，DONG H L.A hierarchical clustering algorithm based on Gaussian mixture model[J].Journal of Computer Research and Development，2006，43（S3）：321-327.
[14] XU X，ESTER M，KRIEGEL H P，et al.A distribution-based clustering algorithm for mining in large spatial databases[C]//Proceedings of the 14th International Conference on Data Engineering，1998：324-331.
[15] CHENG Y.Mean shift，mode seeking，and clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，1995，17（8）：790-799.
[16] AHALT S C，KRISHNAMURTHY A K，CHEN P，et al.Competitive learning algorithms for vector quantization[J].Neural Networks，1990，3（3）：277-290.
[17] XIONG H，WU J，CHEN J.K-means clustering versus validation measures：a data-distribution perspective[J].IEEE Transactions on Systems，Man & Cybernetics：Part B Cybernetics，2009，39（2）：318-331.
[18] LIANG J，BAI L，DANG C，et al.The K-means-type algorithms versus imbalanced data distributions[J].IEEE Transactions on Fuzzy Systems，2012，20（4）：728-745.
[19] TZORTZIS G，LIKAS A.The MinMax k-means clustering algorithm[J].Pattern Recognition，2014，47（7）：2505-2516.
[20] LEE E，SCHMIDT M，WRIGHT J.Improved and simplified inapproximability for k-means[J].Information Processing Letters，2017，120：40-43.
[21] JIA H，CHEUNG Y，LIU J.A new distance metric for unsupervised learning of categorical data[J].IEEE Transactions on Neural Networks and Learning Systems，2015，27（5）：1065-1079.
[22] LU Y，CHEUNG Y M，TANG Y Y.Self-adaptive multiprototype-based competitive learning approach：a k-means-type algorithm for imbalanced data clustering[J].IEEE Transactions on Cybernetics，2021，51（3）：1598-1612.
[23] 袁礼海，李钊，宋建社.利用高斯混合模型实现概率密度函数逼近[J].无线电通信技术，2007，33（2）：20-22.
YUAN L H，LI Z，SONG J S.Probability density function approximation using Gaussian mixture model[J].Radio Communications Technology，2007，33（2）：20-22.
[24] LIU M，JIANG X，KOT A C.A multi-prototype clustering algorithm[J].Pattern Recognition，2009，42（5）：689-698.
[25] LIU Y，LI Z，XIONG H，et al.Understanding and enhancement of internal clustering validation measures[J].IEEE Transactions on Cybernetics，2013，43（3）：982-994.
[26] DING C，HE X.K-nearest-neighbor consistency in data clustering：incorporating local information into global optimization[C]//Proceedings of the 2004 ACM Symposium on Applied Computing，2004：584-589.