Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (18): 162-171.DOI: 10.3778/j.issn.1002-8331.2201-0440

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Cauchy Kernel-Based Density Peaks Clustering Algorithm for Categorical Data

SHENG Jinchao, DU Mingjing, LI Yurui, SUN Jiarui   

  1. School of Computer Science and Technology, Jiangsu Normal University, Xuzhou, Jiangsu 221100, China
  • Online:2022-09-15 Published:2022-09-15

结合柯西核的分类型数据密度峰值聚类算法

盛锦超,杜明晶,李宇蕊,孙嘉睿   

  1. 江苏师范大学 计算机科学与技术学院,江苏 徐州 221100

Abstract: The density peak clustering algorithm has difficulty in producing better clustering results when dealing with categorical data. To address this phenomenon, the article analyzes in detail the reasons for its generation:the overlap problem of distance calculation and the aggregation problem of density calculation. To address the above problems, this article proposes a density peak clustering algorithm for categorical data, referred to as CDPCD. The algorithm points out the ordinal feature (the order relationship between attribute values of categorical data) that rarely exists in the current categorical data distance metric process, and then proposes a weighted ordered distance measure based on probability distribution to alleviate the overlap problem. The data density values are re-evaluated by combining the method of the Cauchy kernel function on a shared nearest neighbor density peak clustering algorithm with improved density calculation and quadratic assignment, which enhances the density diversity and reduces the impact caused by the aggregation problem. Experimental results on several real datasets show that CDPCD achieves better clustering results compared to traditional division-based and density-based clustering algorithms.

Key words: categorical data, ordinal feature, density peak clustering, Cauchy kernel function, data mining

摘要: 密度峰值聚类算法在处理分类型数据时难以产生较好的聚类效果。针对该现象,详细分析了其产生的原因:距离计算的重叠问题和密度计算的聚集问题。同时为了解决上述问题,提出了一种面向分类型数据的密度峰值聚类算法(Cauchy kernel-based density peaks clustering for categorical data,CDPCD)。算法首先指出分类型数据距离度量过程中有序特性(分类型数据属性值之间的顺序关系)鲜有考虑的现状,进而提出一种基于概率分布的加权有序距离度量来缓解重叠问题。通过结合柯西核函数,在共享最近邻密度峰值聚类算法基础上重新评估数据密度值,改进了密度计算和二次分配方式,增强了密度多样性,降低了聚集问题带来的影响。多个真实数据集上的实验结果表明,相较于传统的基于划分和密度的聚类算法,CDPCD都取得了更好的聚类结果。

关键词: 分类型数据, 有序特性, 密度峰值聚类, 柯西核函数, 数据挖掘