计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (8): 128-133.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

一种高维混合属性数据聚类算法

孙浩军,闪光辉,高玉龙,袁  婷   

  1. 汕头大学 工学院,广东 汕头 515063
  • 出版日期:2015-04-15 发布日期:2015-04-29

Algorithm for clustering of high-dimensional data mixed with numeric and categorical attributes

SUN Haojun, SHAN Guanghui, GAO Yulong, YUAN Ting   

  1. College of Engineering, Shantou University, Shantou, Guangdong 515063, China
  • Online:2015-04-15 Published:2015-04-29

摘要: 在许多应用中,很多数据集都具有数值型和分类型数据的混合特征,k-prototype是针对这类数据聚类的经典方法之一,该方法是一种基于k-means和k-mode的聚类方法。在研究了现有的混合属性数据聚类方法之后,引入了一种新算法用于混合型数据聚类,不仅改进了prototype的选取方法,而且提出了一种新的针对混合型数据的相似度度量方式,基于此又提出了一种不同于k-prototype的数据到prototype的分配方式,采用类似层次聚类中凝聚聚类的思想进行聚类,通过在四个真实的混合型数据集上测试发现:与传统算法相比,算法提高了聚类的精度和稳定性。

关键词: 聚类, 混合型数据, 相似度计算, 层次聚类

Abstract: In many applications, many datasets have the features of both numeric and categorical data, the k-prototype is one of the most important algorithms designed for clustering this type data. Based on the studying of the existing clustering algorithms for mixed data, it proposes a new algorithm for the clustering of mixed data, not only modifies the method of the searching of prototypes, but also designs a new measurement of similarity to measure the similarity between data objects. It also proposes a new method that different from k-prototype to allocate data to prototype. It uses the idea similar to the agglomerate clustering in hierarchical clustering to clustering, after the testing on four real mixed datasets it is found that compared with other algorithms, the proposed algorithm not only can improve the accuracy of clustering, but also has the very high stability.

Key words: clustering, mixed data, similarity measure, hierarchical clustering