Algorithm for clustering of high-dimensional data mixed with numeric and categorical attributes

Abstract

Abstract: In many applications, many datasets have the features of both numeric and categorical data, the k-prototype is one of the most important algorithms designed for clustering this type data. Based on the studying of the existing clustering algorithms for mixed data, it proposes a new algorithm for the clustering of mixed data, not only modifies the method of the searching of prototypes, but also designs a new measurement of similarity to measure the similarity between data objects. It also proposes a new method that different from k-prototype to allocate data to prototype. It uses the idea similar to the agglomerate clustering in hierarchical clustering to clustering, after the testing on four real mixed datasets it is found that compared with other algorithms, the proposed algorithm not only can improve the accuracy of clustering, but also has the very high stability.

Key words: clustering, mixed data, similarity measure, hierarchical clustering

摘要： 在许多应用中，很多数据集都具有数值型和分类型数据的混合特征，k-prototype是针对这类数据聚类的经典方法之一，该方法是一种基于k-means和k-mode的聚类方法。在研究了现有的混合属性数据聚类方法之后，引入了一种新算法用于混合型数据聚类，不仅改进了prototype的选取方法，而且提出了一种新的针对混合型数据的相似度度量方式，基于此又提出了一种不同于k-prototype的数据到prototype的分配方式，采用类似层次聚类中凝聚聚类的思想进行聚类，通过在四个真实的混合型数据集上测试发现：与传统算法相比，算法提高了聚类的精度和稳定性。

关键词: 聚类, 混合型数据, 相似度计算, 层次聚类

SUN Haojun, SHAN Guanghui, GAO Yulong, YUAN Ting. Algorithm for clustering of high-dimensional data mixed with numeric and categorical attributes[J]. Computer Engineering and Applications, 2015, 51(8): 128-133.

孙浩军，闪光辉，高玉龙，袁婷. 一种高维混合属性数据聚类算法[J]. 计算机工程与应用, 2015, 51(8): 128-133.

[1]	LAN Hong, HUANG Min. Fusion of KNN Optimized Density Peaks and FCM Clustering Algorithm [J]. Computer Engineering and Applications, 2021, 57(9): 81-88.
[2]	GUO Xiaojing, SUI Haoda. Application of Improved YOLOv3 in Foreign Object Debris Target Detection on Airfield Pavement [J]. Computer Engineering and Applications, 2021, 57(8): 249-255.
[3]	LI Li, JI Xinyuan, SONG Song. Prediction Model for Number of Software Defects in Loop [J]. Computer Engineering and Applications, 2021, 57(7): 158-163.
[4]	HUO Guangyu, ZHANG Yong, SUN Yanfeng, YIN Baocai. Research on Archive Data Intelligent Classification Based on Semantic [J]. Computer Engineering and Applications, 2021, 57(6): 247-253.
[5]	YANG Fang, YIN Xi, SI Jianhui, LIU Hongyuan, WANG Xue. Mathematical Expression Similarity Calculation Method Based on Focus Clustering [J]. Computer Engineering and Applications, 2021, 57(6): 88-93.
[6]	ZHAO Fan, ZHANG Lin, WEN Zhiquan, YANG Linlin, LIN Guangfeng. Direct and Efficient Natural Scene Chinese Character Approaching Spotting Method [J]. Computer Engineering and Applications, 2021, 57(6): 159-167.
[7]	PENG Qihui, XUAN Shibin, GAO Qing. Distribution Automatic Threshold Density Peak Clustering Algorithm [J]. Computer Engineering and Applications, 2021, 57(5): 71-78.
[8]	LI Yongzhen, LIAO Husheng. Multi-view Clustering via Graph Convolutional Neural Network [J]. Computer Engineering and Applications, 2021, 57(5): 115-122.
[9]	WANG Changlong, ZHANG Yuandong, MIAO Hong, YANG Yuheng. Application of Double Channel Convolutional Neural Network in Pumpkin Diseases Identification [J]. Computer Engineering and Applications, 2021, 57(5): 183-189.
[10]	HU Xiaomin, WANG Mingfeng, ZHANG Shourong, LI Min. New Differential Evolution with Particle Swarm Optimization Algorithm for Text Clustering [J]. Computer Engineering and Applications, 2021, 57(4): 61-67.
[11]	WANG Junling, LU Xinming. Video Key Frame Extraction Algorithm Based on Semantic Correlation [J]. Computer Engineering and Applications, 2021, 57(4): 192-198.
[12]	WANG Fuyin, ZHANG Desheng, ZHANG Xiao. Adaptive Density Peaks Clustering Algorithm Combining with Whale Optimization Algorithm [J]. Computer Engineering and Applications, 2021, 57(3): 94-102.
[13]	CHEN Junfeng, ZHENG Zhongtuan. Over-Sampling Method on Imbalanced Data Based on WKMeans and SMOTE [J]. Computer Engineering and Applications, 2021, 57(23): 106-112.
[14]	ZHANG Zhonglin, ZHAO Yu, YAN Guanghui. Natural Neighbor Density Extremum Clustering Algorithm [J]. Computer Engineering and Applications, 2021, 57(23): 200-210.
[15]	MEI Jie, WEI Yuanyuan, XU Taosheng. Fusion Clustering Algorithm Based on Multi-Prototypes Using Density Peaks [J]. Computer Engineering and Applications, 2021, 57(22): 78-85.

Algorithm for clustering of high-dimensional data mixed with numeric and categorical attributes

一种高维混合属性数据聚类算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics