快速大样本同步聚类

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (23): 159-166.

快速大样本同步聚类

乔颖，王士同

江南大学数字媒体学院，江苏无锡 214122

出版日期:2016-12-01 发布日期:2016-12-20

Fast clustering by synchronization on large sample

QIAO Ying, WANG Shitong

College of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122, China

Online:2016-12-01 Published:2016-12-20

摘要/Abstract

摘要： 针对现有的Sync算法具有较高时间复杂度，在处理大样本数据集时有相当的局限性，提出了一种快速大样本同步聚类算法（Fast Clustering by Synchronization on Large Sample，FCSLS）。首先将基于核密度估计（KDE）的抽样方法对大样本数据进行抽样压缩，再在压缩集上进行同步聚类，通过Davies-Bouldin指标自动寻优到最佳聚类数，最后，对剩下的大规模数据进行聚类，得到最终聚类结果。通过在人造数据集以及UCI真实数据集上的实验，FCSLS可以在大规模数据集上得到任意形状、密度、大小的聚类且不需要预设聚类数。同时与基于压缩集密度估计和中心约束最小包含球技术的快速压缩方法相比，FCSLS在不损失聚类精度的情况下，极大地缩短了同步聚类算法的运行时间。

关键词: 核密度估计（KDE）, 抽样, 同步, 大样本, 聚类

Abstract: Since the existing clustering synchronization clustering algorithm Sync is highly complex in time, and it cannot be applied into the case of large sample, it proposes a new algorithm named Fast Clustering by Synchronization on Large Sample（FCSLS）. To apply this algorithm, it firstly condenses the large sample dataset by using the KDE based sampling method, and then, carries out the cluster synchronization of compressed dataset, finding out the best clustering data by using the Davies-Bouldin clustering criterion, finally, gets the final clustering results by clustering the rest objects in the large dataset. Based on the empirical result from the synthetic datasets and UCI real-world datasets, it concludes that FCSLS can detect clusters of any shape density and size without pre-setting the cluster number. Meanwhile, comparing with the compression algorithm based on RSDE and CCMEB, FCSLS can significantly reduce the operation time of the cluster synchronization algorithm without losing the clustering accuracy.

Key words: Kernel Density Estimate（KDE）, sampling, synchronization, large sample, clustering

乔颖，王士同. 快速大样本同步聚类[J]. 计算机工程与应用, 2016, 52(23): 159-166.

QIAO Ying, WANG Shitong. Fast clustering by synchronization on large sample[J]. Computer Engineering and Applications, 2016, 52(23): 159-166.

[1]	兰红，黄敏. 融合KNN优化的密度峰值和FCM聚类算法[J]. 计算机工程与应用, 2021, 57(9): 81-88.
[2]	郭晓静，隋昊达. 改进YOLOv3在机场跑道异物目标检测中的应用[J]. 计算机工程与应用, 2021, 57(8): 249-255.
[3]	李莉，纪欣沅，宋嵩. 回环软件缺陷数量预测模型[J]. 计算机工程与应用, 2021, 57(7): 158-163.
[4]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[5]	杨芳，尹曦，司建辉，刘宏媛，汪雪. 基于侧重点聚类的数学表达式相似度计算方法[J]. 计算机工程与应用, 2021, 57(6): 88-93.
[6]	赵凡，张琳，闻治泉，杨林林，蔺广逢. 一种直接高效的自然场景汉字逼近定位方法[J]. 计算机工程与应用, 2021, 57(6): 159-167.
[7]	彭启慧，宣士斌，高卿. 分布的自动阈值密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(5): 71-78.
[8]	李勇振，廖湖声. 基于图卷积神经网络的多视角聚类[J]. 计算机工程与应用, 2021, 57(5): 115-122.
[9]	王昌龙，张远东，缪宏，杨煜恒. 双通道卷积神经网络在南瓜病害识别上的应用[J]. 计算机工程与应用, 2021, 57(5): 183-189.
[10]	胡晓敏，王明丰，张首荣，李敏. 用于文本聚类的新型差分进化粒子群算法[J]. 计算机工程与应用, 2021, 57(4): 61-67.
[11]	王俊玲，卢新明. 基于语义相关的视频关键帧提取算法[J]. 计算机工程与应用, 2021, 57(4): 192-198.
[12]	王芙银，张德生，张晓. 结合鲸鱼优化算法的自适应密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(3): 94-102.
[13]	陈俊丰，郑中团. WKMeans与SMOTE结合的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(23): 106-112.
[14]	张忠林，赵昱，闫光辉. 自然邻居密度极值聚类算法[J]. 计算机工程与应用, 2021, 57(23): 200-210.
[15]	梅婕，魏圆圆，许桃胜. 基于密度峰值多起始中心的融合聚类算法[J]. 计算机工程与应用, 2021, 57(22): 78-85.