Fast clustering by synchronization on large sample

Abstract

Abstract: Since the existing clustering synchronization clustering algorithm Sync is highly complex in time, and it cannot be applied into the case of large sample, it proposes a new algorithm named Fast Clustering by Synchronization on Large Sample（FCSLS）. To apply this algorithm, it firstly condenses the large sample dataset by using the KDE based sampling method, and then, carries out the cluster synchronization of compressed dataset, finding out the best clustering data by using the Davies-Bouldin clustering criterion, finally, gets the final clustering results by clustering the rest objects in the large dataset. Based on the empirical result from the synthetic datasets and UCI real-world datasets, it concludes that FCSLS can detect clusters of any shape density and size without pre-setting the cluster number. Meanwhile, comparing with the compression algorithm based on RSDE and CCMEB, FCSLS can significantly reduce the operation time of the cluster synchronization algorithm without losing the clustering accuracy.

Key words: Kernel Density Estimate（KDE）, sampling, synchronization, large sample, clustering

摘要： 针对现有的Sync算法具有较高时间复杂度，在处理大样本数据集时有相当的局限性，提出了一种快速大样本同步聚类算法（Fast Clustering by Synchronization on Large Sample，FCSLS）。首先将基于核密度估计（KDE）的抽样方法对大样本数据进行抽样压缩，再在压缩集上进行同步聚类，通过Davies-Bouldin指标自动寻优到最佳聚类数，最后，对剩下的大规模数据进行聚类，得到最终聚类结果。通过在人造数据集以及UCI真实数据集上的实验，FCSLS可以在大规模数据集上得到任意形状、密度、大小的聚类且不需要预设聚类数。同时与基于压缩集密度估计和中心约束最小包含球技术的快速压缩方法相比，FCSLS在不损失聚类精度的情况下，极大地缩短了同步聚类算法的运行时间。

关键词: 核密度估计（KDE）, 抽样, 同步, 大样本, 聚类

QIAO Ying, WANG Shitong. Fast clustering by synchronization on large sample[J]. Computer Engineering and Applications, 2016, 52(23): 159-166.

乔颖，王士同. 快速大样本同步聚类[J]. 计算机工程与应用, 2016, 52(23): 159-166.

[1]	LAN Hong, HUANG Min. Fusion of KNN Optimized Density Peaks and FCM Clustering Algorithm [J]. Computer Engineering and Applications, 2021, 57(9): 81-88.
[2]	GUO Xiaojing, SUI Haoda. Application of Improved YOLOv3 in Foreign Object Debris Target Detection on Airfield Pavement [J]. Computer Engineering and Applications, 2021, 57(8): 249-255.
[3]	LI Li, JI Xinyuan, SONG Song. Prediction Model for Number of Software Defects in Loop [J]. Computer Engineering and Applications, 2021, 57(7): 158-163.
[4]	HUO Guangyu, ZHANG Yong, SUN Yanfeng, YIN Baocai. Research on Archive Data Intelligent Classification Based on Semantic [J]. Computer Engineering and Applications, 2021, 57(6): 247-253.
[5]	YANG Fang, YIN Xi, SI Jianhui, LIU Hongyuan, WANG Xue. Mathematical Expression Similarity Calculation Method Based on Focus Clustering [J]. Computer Engineering and Applications, 2021, 57(6): 88-93.
[6]	ZHAO Fan, ZHANG Lin, WEN Zhiquan, YANG Linlin, LIN Guangfeng. Direct and Efficient Natural Scene Chinese Character Approaching Spotting Method [J]. Computer Engineering and Applications, 2021, 57(6): 159-167.
[7]	PENG Qihui, XUAN Shibin, GAO Qing. Distribution Automatic Threshold Density Peak Clustering Algorithm [J]. Computer Engineering and Applications, 2021, 57(5): 71-78.
[8]	LI Yongzhen, LIAO Husheng. Multi-view Clustering via Graph Convolutional Neural Network [J]. Computer Engineering and Applications, 2021, 57(5): 115-122.
[9]	WANG Changlong, ZHANG Yuandong, MIAO Hong, YANG Yuheng. Application of Double Channel Convolutional Neural Network in Pumpkin Diseases Identification [J]. Computer Engineering and Applications, 2021, 57(5): 183-189.
[10]	HU Xiaomin, WANG Mingfeng, ZHANG Shourong, LI Min. New Differential Evolution with Particle Swarm Optimization Algorithm for Text Clustering [J]. Computer Engineering and Applications, 2021, 57(4): 61-67.
[11]	WANG Junling, LU Xinming. Video Key Frame Extraction Algorithm Based on Semantic Correlation [J]. Computer Engineering and Applications, 2021, 57(4): 192-198.
[12]	WANG Fuyin, ZHANG Desheng, ZHANG Xiao. Adaptive Density Peaks Clustering Algorithm Combining with Whale Optimization Algorithm [J]. Computer Engineering and Applications, 2021, 57(3): 94-102.
[13]	CHEN Junfeng, ZHENG Zhongtuan. Over-Sampling Method on Imbalanced Data Based on WKMeans and SMOTE [J]. Computer Engineering and Applications, 2021, 57(23): 106-112.
[14]	ZHANG Zhonglin, ZHAO Yu, YAN Guanghui. Natural Neighbor Density Extremum Clustering Algorithm [J]. Computer Engineering and Applications, 2021, 57(23): 200-210.
[15]	WANG Le, HAN Meng, LI Xiaojuan, ZHANG Ni, CHENG Haodong. Review of Classification Methods for Unbalanced Data Sets [J]. Computer Engineering and Applications, 2021, 57(22): 42-52.

Fast clustering by synchronization on large sample

快速大样本同步聚类

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics