Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (6): 186-193.DOI: 10.3778/j.issn.1002-8331.1901-0423

Previous Articles     Next Articles

Improved Distance Formula of [K]-modes Clustering Algorithm for Mixed Categorical Attribute Data

YUAN Fang, YANG Youlong   

  1. School of Mathematics and Statistics, Xidian University, Xi’an 710126, China
  • Online:2020-03-15 Published:2020-03-13

针对混合型分类数据改进的[K]-modes算法距离公式

袁方,杨有龙   

  1. 西安电子科技大学 数学与统计学院,西安 710126

Abstract:

Traditional [K]-modes algorithm is widely used in categorical attribute clustering, but traditional algorithms do not distinguish ordinal categorical attribute and disordered categorical attribute. On the basis of distinguishing the two attributes, a new distance formula is proposed and the algorithm flow is optimized. The reasonable range of the distance between two adjacent attribute value of ordinal categorical attribute is determined by the distance value of the disordered categorical attributes. Based on the sequential relationship of the ordinal categorical attributes, the distance formula of ordinal categorical attribute is constructed. The proportion of each attribute value in the cluster is introduced as the distance parameter to calculate the distance between the data points and the centroid. The new distance formula describes the distance of ordinal attributes well, and balances the difference between the distance formulas of two different categorical attributes. The experimental results show that the improved algorithm and distance formula proposed in this paper is more effective than the original [K]-modes algorithm and its improved algorithm on UCI real data sets.

Key words: [K]-modes algorithm, ordinal attribute, mixed-type data, distance formula of mixed type data

摘要:

传统[K]-modes算法在分类属性聚类中有着广泛的应用,但是传统算法并不区分有序分类属性与无序分类属性。在区分这两种属性的基础上,提出了一种新的距离公式,并优化了算法流程。基于无序分类属性的距离数值,确定了有序分类属性相邻属性值之间距离数值的合理范围。借助有序分类属性蕴含的顺序关系,构建了有序分类属性的距离公式。计算样本点与质心距离之时,引入了簇内各属性值的比例作为总体距离公式的重要参数。综上,新的距离公式良好地刻画了有序分类属性的距离,并且平衡了两种不同分类属性距离公式之间的差异性。实验结果表明,提出的改进算法和距离公式在UCI真实数据集上比原始[K]-modes算法及其改进算法均有显著的效果。

关键词: [K]-modes算法, 有序分类属性, 混合型数据, 混合型数据距离公式