计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (8): 138-142.

• 数据库、信号与信息处理 • 上一篇    下一篇

改进的共享最近邻聚类算法

李 霞,蒋盛益   

  1. 广东外语外贸大学 思科信息学院,广州 510006
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-03-11 发布日期:2011-03-11

Improved shared nearest neighbor clustering algorithm

LI Xia,JIANG Shengyi   

  1. Cisco School of Informatics,Guangdong University of Foreign Studies,Guangzhou 510006,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-03-11 Published:2011-03-11

摘要: 聚类是一种无监督的机器学习方法,其任务是发现数据中的自然簇。共享最近邻聚类算法(SNN)在处理大小不同、形状不同以及密度不同的数据集上具有很好的聚类效果,但该算法还存在以下不足:(1)时间复杂度为O(n2),不适合处理大规模数据集;(2)没有明确给出参数阈值的简单指导性操作方法;(3)只能处理数值型属性数据集。对共享最近邻算法进行改进,使其能够处理混合属性数据集,并给出参数阈值的简单选择方法,改进后算法运行时间与数据集大小成近似线性关系,适用于大规模高维数据集。在真实数据集和人造数据集上的实验结果表明,提出的改进算法是有效可行的。

关键词: 共享最近邻聚类算法, 一趟聚类算法, 大规模数据集

Abstract: Clustering is a method of unsupervised learning in machine learning,the typical task of which is to discovery “natural” clusters present in the data.The shared nearest neighbor algorithm is one of the most efficient clustering algorithm which can handle datasets of different sizes,shapes and densities.But there are still some shortages about the algorithm.SNN can’t handle large dataset because of its high complexity.There are no definite methods about threshold of the algorithm.SNN can not process databases with mixture attributes.This paper improves the SNN algorithm to process the data with categorical attributes,gives a simple definite method to select threshold of the algorithm.The time complexity of the improved algorithm is nearly linear with the size of dataset and can be used to large dataset.The experimental results on real datasets and synthetic datasets show that the improved algorithm is effective and practicable.

Key words: shared nearest neighbor clustering algorithm, one-pass clustering algorithm, large dataset