Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (23): 102-107.DOI: 10.3778/j.issn.1002-8331.1606-0143

Previous Articles     Next Articles

k-modes algorithm based on structural similarity

HUANG Yuanhua1, XIE Feng1, HAO Zhifeng1,2,3, CAI Ruichu2   

  1. 1.School of Applied Mathematics, Guangdong University of Technology, Guangzhou 510520, China
    2.School of Computer Science, Guangdong University of Technology, Guangzhou 510006, China
    3.School of Mathematics and Big Data, Foshan University, Foshan, Guangdong 528000, China
  • Online:2017-12-01 Published:2017-12-14

基于结构相似性的k-modes算法

黄苑华1,谢  峰1,郝志峰1,2,3,蔡瑞初2   

  1. 1.广东工业大学 应用数学学院,广州 510520
    2.广东工业大学 计算机学院,广州 510006
    3.佛山科技技术学院 数学与大数据学院,广东 佛山 528000

Abstract: Clustering is one of the important technology in data mining, which is based on similar principles to classify data. However, categorical data clustering is an important and difficult issue among many learning algorithms. The traditional k-modes algorithm uses a simple 0-1 matching method to define dissimilarity between two attribute values, does not take the distribution of the entire data set into account, which results in inaccurate measurement differences. Aiming at this problem, a k-modes algorithm based on structure similarity is proposed. The algorithm not only considers the attribute values of their own similarities and differences, but also considers the structure of them in other attributes. The simulation results from two aspects of cluster identification and accuracy show that the k-modes algorithm based on structure similarity is more effective in scalability and accuracy.

Key words: cluster analysis, categorical data, dissimilarity measure, structural similarity, k-modes algorithm

摘要: 聚类是数据挖掘中重要的技术之一,它是按照相似原则将数据进行分类。然而分类型数据的聚类是学习算法中重要而又棘手的问题。传统的k-modes算法采用简单的0-1匹配方法定义两个属性值之间的相异度,没有将整个数据集的分布考虑进来,导致差异性度量不够准确。针对这个问题,提出基于结构相似性的k-modes算法。该算法不仅考虑属性值它们本身的异同,而且考虑了它们在其他属性下所处的结构。从集群识别和准确率两个方面进行仿真实验,表明基于结构相似性的k-modes算法在伸缩性和准确率方面更有效。

关键词: 聚类分析, 分类型数据, 相异度度量, 结构相似性, k-modes算法