Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (1): 168-173.DOI: 10.3778/j.issn.1002-8331.1709-0211

Previous Articles     Next Articles

Clustering Algorithm for Mixed Categorical Data

LIN Qiang, TANG Jiashan   

  1. College of Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
  • Online:2019-01-01 Published:2019-01-07

一种适用于混合型分类数据的聚类算法

林  强,唐加山   

  1. 南京邮电大学 理学院,南京 210023

Abstract: The K-modes algorithm is a traditional clustering technique, which uses a simple matching method to calculate the distance of different attribute values within one, while the weights of all attributes are the same. Taking this into account, the paper gives a new improved clustering algorithm. The new algorithm is more suitable for mixed categorical data by considering the sequential relation of attribute values in orderly categorical data, and the similarity between different attribute values in disordered categorical data and the relationship between attributes. The new algorithm deals with orderly categorical data and disordered categorical data by using different distance measurements. Moreover, the weights of attributes are given by average entropy. The experimental results show that the algorithm presented has better performance than the K-modes algorithm and its improved algorithm in both the artificial data set and the real data set.

Key words: clustering algorithm, mixed categorical data, distance metric, K-modes algorithm

摘要: 传统的K-modes算法采用简单的属性匹配方式计算同一属性下不同属性值的距离,并且计算样本距离时令所有属性权重相等。在此基础上,综合考虑有序型分类数据中属性值的顺序关系、无序型分类数据中不同属性值之间的相似性以及各属性之间的关系等,提出一种更加适用于混合型分类数据的改进聚类算法,该算法对无序型分类数据和有序型分类数据采用不同的距离度量,并且用平均熵赋予相应的权重。实验结果表明,改进算法在人工数据集和真实数据集上均有比K-modes算法及其改进算法更好的聚类效果。

关键词: 聚类算法, 混合型分类数据, 距离度量, K-modes算法