Improved K-means Algorithm Based on Distance and Weight

doi:10.3778/j.issn.1002-8331.2009-0103

Abstract

Abstract:

K-means clustering algorithm is simple, efficient and widely used. The randomness of the selection of the initial clustering center of the traditional K-means algorithm leads to the problem that the algorithm is easy to fall into the local optimal and the K value needs to be determined manually. In order to obtain the most appropriate initial clustering center, an improved K-means algorithm based on distance and sample weight is proposed. This clustering algorithm uses dimensionally-weighted Euclidean distance to measure the distance between sample points, after calculating the density and weight of all samples, the point with the highest density is used as the first initial cluster center, and all samples within the cluster are eliminated, then, according to the last cluster center and the weights of the remaining sample points in the data set, the next initial cluster center is found through the introduced parameter [τi], this process is repeated until the data set is empty, finally [k] initial cluster centers are automatically obtained. The experiments are carried out on the UCI data set. Compared with the classical K-means algorithm, WK-means algorithm, ZK-means algorithm and DCK-means algorithm, the improved K-means algorithm based on distance and weight has better clustering effect.

Key words: data mining, K-means algorithm, initial cluster center, weighted Euclidean distance, weight product

摘要：

K-means聚类算法简单高效，应用广泛。针对传统K-means算法初始聚类中心点的选择随机性导致算法易陷入局部最优以及K值需要人工确定的问题，为了得到最合适的初始聚类中心，提出一种基于距离和样本权重改进的K-means算法。该聚类算法采用维度加权的欧氏距离来度量样本点之间的远近，计算出所有样本的密度和权重后，令密度最大的点作为第一个初始聚类中心，并剔除该簇内所有样本，然后依次根据上一个聚类中心和数据集中剩下样本点的权重并通过引入的参数[τi]找出下一个初始聚类中心，不断重复此过程直至数据集为空，最后自动得到[k]个初始聚类中心。在UCI数据集上进行测试，对比经典K-means算法、WK-means算法、ZK-means算法和DCK-means算法，基于距离和权重改进的K-means算法的聚类效果更好。

关键词: 数据挖掘, K-means算法, 初始聚类中心, 加权欧式距离, 权重

WANG Zilong, LI Jin, SONG Yafei. Improved K-means Algorithm Based on Distance and Weight[J]. Computer Engineering and Applications, 2020, 56(23): 87-94.

王子龙，李进，宋亚飞. 基于距离和权重改进的K-means算法[J]. 计算机工程与应用, 2020, 56(23): 87-94.

[1]	ZONG Xiaoping, TAO Zeze. Knowledge Tracing Model Based on Mastery Speed [J]. Computer Engineering and Applications, 2021, 57(6): 117-123.
[2]	GAO Tianyu, WANG Qingrong, YANG Lei. Data Mining Model Based on Attribute Dependability Enhancement of Rough Set [J]. Computer Engineering and Applications, 2021, 57(3): 87-93.
[3]	MA Yang, ZHAO Xujun. Multi-source Outlier Detection Algorithm Based on Relevant Subspace [J]. Computer Engineering and Applications, 2021, 57(17): 88-95.
[4]	ZHANG Nianpeng, WU Xu, ZHU Qiang. Entropy-Based Oversampling Framework [J]. Computer Engineering and Applications, 2021, 57(13): 96-101.
[5]	ZHANG Bowen, LIU Zhi, SANG Guoming. Anomaly Detection Algorithm Based on Kernel Density Fluctuation [J]. Computer Engineering and Applications, 2021, 57(12): 132-136.
[6]	RAO Jiawang, MA Ronghua. Improved Kernel Density Estimator Based Spatial Point Density Algorithm [J]. Computer Engineering and Applications, 2021, 57(11): 260-265.
[7]	PAN Chengsheng, ZHANG Bin, LYU Yana, DU Xiuli, QIU Shaoming. K-Means Text Clustering Based on Improved Gray Wolf Optimization Algorithm [J]. Computer Engineering and Applications, 2021, 57(1): 188-193.
[8]	WANG Jie, CHEN Zhigang, LIU Jialing, CHENG Hongbing. Privacy Behavior Mining Technology for Cloud Computing Based on Clustering [J]. Computer Engineering and Applications, 2020, 56(5): 80-84.
[9]	YI Junyan, WU Boya, YONG Qiaoling. Research on Clustering Algorithm of Elastic Net with Weighted Characteristics [J]. Computer Engineering and Applications, 2020, 56(22): 55-65.
[10]	JI Wenlu, WANG Hailong, SU Guibin, LIU Lin. Review of Recommendation Methods Based on Association Rules Algorithm [J]. Computer Engineering and Applications, 2020, 56(22): 33-41.
[11]	ZHANG Zhen, LI Haofang, LI Mengzhou. Research on YOLO Algorithm in Abnormal Security Images [J]. Computer Engineering and Applications, 2020, 56(21): 187-193.
[12]	LIU Wenfen, MU Xiaodong, HUANG Yuehua. Anomaly Detection Method Based on Multi-resolution Grid [J]. Computer Engineering and Applications, 2020, 56(17): 78-85.
[13]	LI Feng, LI Mingxiang, ZHANG Yujing. Partial Iterative Fast K-means Clustering Algorithm [J]. Computer Engineering and Applications, 2020, 56(13): 63-71.
[14]	WANG Jianren, MA Xin, DUAN Ganglong. Improved K-means Clustering k-Value Selection Algorithm [J]. Computer Engineering and Applications, 2019, 55(8): 27-33.
[15]	MENG Haidong1，2, SUN Xinjun2, SONG Yuchen1. Improved LOF Algorithm Based on Data Field [J]. Computer Engineering and Applications, 2019, 55(3): 154-158.

Improved K-means Algorithm Based on Distance and Weight

基于距离和权重改进的K-means算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics