计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (12): 47-53.DOI: 10.3778/j.issn.1002-8331.1905-0357

• 理论与研发 • 上一篇    下一篇

聚类混合型数据的密度峰值改进算法

谭阳,唐德权,曹守富   

  1. 1.湖南师范大学 数学与统计学院,长沙 410081
    2.湖南广播电视大学 网络技术系,长沙 410004
    3.湖南警察学院 信息技术系,长沙 410138
  • 出版日期:2020-06-15 发布日期:2020-06-09

Density Peak Improvement Algorithm for Clustering Hybrid Data

TAN Yang, TANG Dequan, CAO Shoufu   

  1. 1.College of Mathematics and Statistics, Hunan Normal University, Changsha 410081, China
    2.Department of Network Technology, Hunan Radio and Television University, Changsha 410004, China
    3.Department of Information Technology, Hunan Police Academy, Changsha 410138, China
  • Online:2020-06-15 Published:2020-06-09

摘要:

聚类混合型数据,通常是依据样本属性类别的不同分别进行评价。但这种将样本属性划分到不同子空间中分别度量的方式,割裂了样本属性原有的统一性;导致对样本个体的相似性评价产生了非一致的度量偏差。针对这一问题,提出以二进制编码样本属性,再由海明差异对属性编码施行统一度量的新的聚类算法。新算法通过在统一的框架内对混合型数据实施相似性度量,避免了对样本属性的切割,在此基础上又根据不同属性的性质赋予其不同的权重,并以此评价样本个体之间的相似程度。实验结果表明,新算法能够有效地聚类混合型数据;与已有的其他聚类算法相比较,表现出更好的聚类准确率及稳定性。

关键词: 聚类, 混合型数据, 密度峰值, 属性编码, 海明度量

Abstract:

Clustering mixed data is usually evaluated according to the difference of sample attribute categories. However, this way of dividing the sample attributes into different subspaces separately separates the original unity of the sample attributes, and leads to the non-consistent metric deviation for the similarity evaluation of the sample individual. Concerning this issue, a new clustering algorithm based on binary coded sample attributes is proposed, and then unified metrics for attribute coding are carried out by Hamming’s difference. The new algorithm avoids the cutting of sample attributes by performing similarity measures on mixed data within a unified framework. Based on this, it also assigns different weights based on the properties of different attributes and to evaluate the similarity between the samples. The experimental results show that the new algorithm can effectively cluster mixed data, and compared with other existing clustering algorithms, it shows better clustering accuracy and stability.

Key words: clustering, hybrid data, density peak, attribute coding, Hamming metric