结合柯西核的分类型数据密度峰值聚类算法

doi:10.3778/j.issn.1002-8331.2201-0440

摘要/Abstract

摘要： 密度峰值聚类算法在处理分类型数据时难以产生较好的聚类效果。针对该现象，详细分析了其产生的原因：距离计算的重叠问题和密度计算的聚集问题。同时为了解决上述问题，提出了一种面向分类型数据的密度峰值聚类算法（Cauchy kernel-based density peaks clustering for categorical data，CDPCD）。算法首先指出分类型数据距离度量过程中有序特性（分类型数据属性值之间的顺序关系）鲜有考虑的现状，进而提出一种基于概率分布的加权有序距离度量来缓解重叠问题。通过结合柯西核函数，在共享最近邻密度峰值聚类算法基础上重新评估数据密度值，改进了密度计算和二次分配方式，增强了密度多样性，降低了聚集问题带来的影响。多个真实数据集上的实验结果表明，相较于传统的基于划分和密度的聚类算法，CDPCD都取得了更好的聚类结果。

关键词: 分类型数据, 有序特性, 密度峰值聚类, 柯西核函数, 数据挖掘

Abstract: The density peak clustering algorithm has difficulty in producing better clustering results when dealing with categorical data. To address this phenomenon, the article analyzes in detail the reasons for its generation：the overlap problem of distance calculation and the aggregation problem of density calculation. To address the above problems, this article proposes a density peak clustering algorithm for categorical data, referred to as CDPCD. The algorithm points out the ordinal feature （the order relationship between attribute values of categorical data） that rarely exists in the current categorical data distance metric process, and then proposes a weighted ordered distance measure based on probability distribution to alleviate the overlap problem. The data density values are re-evaluated by combining the method of the Cauchy kernel function on a shared nearest neighbor density peak clustering algorithm with improved density calculation and quadratic assignment, which enhances the density diversity and reduces the impact caused by the aggregation problem. Experimental results on several real datasets show that CDPCD achieves better clustering results compared to traditional division-based and density-based clustering algorithms.

Key words: categorical data, ordinal feature, density peak clustering, Cauchy kernel function, data mining

盛锦超, 杜明晶, 李宇蕊, 孙嘉睿. 结合柯西核的分类型数据密度峰值聚类算法[J]. 计算机工程与应用, 2022, 58(18): 162-171.

SHENG Jinchao, DU Mingjing, LI Yurui, SUN Jiarui. Cauchy Kernel-Based Density Peaks Clustering Algorithm for Categorical Data[J]. Computer Engineering and Applications, 2022, 58(18): 162-171.

参考文献

[1] ZOU H.Clustering algorithm and its application in data mining[J].Wireless Personal Communications，2020，110（1）：21-30.
[2] HAMERLY G，ELKAN C.Learning the k in k-means[C]//Proceedings of the 16th Advances in Neural Information Processing Systems，Vancouver and Whistler，December 8-13，2003.Cambridge：MIT Press，2003：281-288.
[3] CHATURVEDI A，GREEN P E，CAROLL J D.K-modes clustering[J].Journal of Classification，2001，18（1）：35-55.
[4] BOOK A，KULYN V A，RAITA T.Generalized hamming distance[J].Information Retrieval，2002，5（4）：353-375.
[5] HUANG J Z，NG M K，RONG H，et al.Automated variable weighting in k-means type clustering[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2005，27（5）：657-668.
[6] BAI L，LIANG J，DANG C，et al.A novel attribute-weighting algorithm for clustering high-dimensional categorical data[J].Pattern Recognition，2011，44（12）：2843-2861.
[7] BORIAH S，CHANDOLA V，KUMAR V.Similarity measures for categorical data：a comparative evaluation[C]//Proceedings of the SIAM International Conference on Data Mining，Georgia，April 24-26，2008.Philadelphia：SIAM，2008：243-254.
[8] ?ULC Z，?EZANKOVá H.Comparison of similarity measures for categorical data in hierarchical clustering[J].Journal of Classification，2019，36（1）：58-72.
[9] ZHANG Y，CHEUNG Y M.An ordinal data clustering algorithm with automated distance learning[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence，NY，February 7-12，2020.Cambridge：AAAI Press，2020：6869-6876.
[10] ZHANG Y，CHEUNG Y M，TAN K C.A unified entropy-based distance metric for ordinal-and-nominal-attribute data clustering[J].IEEE Transactions on Neural Networks and Learning Systems，2019，31（1）：39-52.
[11] ZHANG Y，CHEUNG Y M.Learnable weighting of intra-attribute distances for categorical data clustering with nominal and ordinal attributes[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2021，44（7）：3560-3576.
[12] ZHANG Y，CHEUNG Y M.A new distance metric exploiting heterogeneous inter-attribute relationship for ordinal-and-nominal-attribute data clustering[J].IEEE Transactions on Cybernetics，2022，52（2）：758-771.
[13] RODRIGUEZ A，LAIO A.Clustering by fast search and find of density peaks[J].Science，2014，344（6191）：1492-1496.
[14] DUAN L，XU L，GUO F，et al.A local-density based spatial clustering algorithm with noise[J].Information Systems，2007，32（7）：978-986.
[15] LIU R，WANG H，YU X.Shared-nearest-neighbor-based clustering by fast search and find of density peaks[J].Information Sciences，2018，450：200-226.
[16] 王大刚，丁世飞，钟锦.基于二阶k近邻的密度峰值聚类算法研究[J].计算机科学与探索，2021，15（8）：1490-1500.
WANG D G，DING S F，ZHONG J.Research of density peaks clustering algorithm based on second-order k neighbors[J].Journal of Frontiers of Computer Science and Technology，2021，15（8）：1490-1500.
[17] 丁世飞，徐晓，王艳茹.基于不相似性度量优化的密度峰值聚类算法[J].软件学报，2020，31（11）：3321-3333.
DING S F，XU X，WANG Y R.Optimized density peaks clustering algorithm based on dissimilarity measure[J].Journal of Software，2020，31（11）：3321-3333.
[18] 柏锷湘，罗可，罗潇.结合自然和共享最近邻的密度峰值聚类算法[J].计算机科学与探索，2021，15（5）：931-940.
BAI E X，LUO K，LUO X.Peak density clustering algorithm combining natural and shared nearest neighbor[J].Journal of Frontiers of Computer Science and Technology，2021，15（5）：931-940.
[19] 赵嘉，姚占峰，吕莉，等.基于相互邻近度的密度峰值聚类算法[J].控制与决策，2021，36（3）：543-552.
ZHAO J，YAO Z F，LV L，et al.Density peaks clustering based on mutual neighbor degree[J].Control and Decision，2021，36（3）：543-552.
[20] VAN DER MATTEN L，HINTON G.Visualizing data using t-SNE[J].Journal of Machine Learning Research，2008，9（11）：2579-2605.
[21] DU M J，WANG R，JI R，et al.ROBP a robust border-peeling clustering using Cauchy kernel[J].Information Sciences，2021，571：375-400.
[22] AHMAD A，DEY L.A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set[J].Pattern Recognition Letters，2007，28（1）：110-118.
[23] RUBNER Y，TOMASI C，GUIBAS L J.The earth mover’s distance as a metric for image retrieval[J].International Journal of Computer Vision，2000，40（2）：99-121.
[24] 张忠林，赵昱，闫光辉.自然邻居密度极值聚类算法[J].计算机工程与应用，2021，57（23）：200-210.
ZHANG Z L，ZHAO Y，YAN G H.Natural neighbor density extremum clustering algorithm[J].Computer Engineering and Applications，2021，57（23）：200-210.
[25] MAULIK U，BANDYOPADHYAY S.Performance evaluation of some clustering algorithms and validity indices[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2002，24（12）：1650-1654.