计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (9): 101-103.

• 数据库、信号与信息处理 • 上一篇    下一篇

基于层次聚类识别数据集前n个全局孤立点

梁斌梅   

  1. 1.广西大学 数学与信息科学学院,南宁 530004
    2.四川大学 计算机学院,成都 610065
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2012-03-21 发布日期:2012-04-11

Detection of top-n global outliers in datasets based on hierarchical clustering

LIANG Binmei   

  1. 1.College of Mathematics and Information Science, Guangxi University, Nanning 530004, China
    2.College of Computer Science, Sichuan University, Chengdu 610065, China
  • Received:1900-01-01 Revised:1900-01-01 Online:2012-03-21 Published:2012-04-11

摘要: 孤立数据的存在使数据挖掘结果不准确,甚至错误。现有的孤立点检测算法在通用性、有效性、用户友好性及处理高维大数据集的性能还不完善,为此,提出一种有效的全局孤立点检测方法,该方法进行凝聚层次聚类,根据聚类树和距离矩阵来可视化判断数据孤立程度,确定孤立点数目。从聚类树自顶向下,无监督地去除离群数据点。在多个数据集上的仿真实验结果表明,该方法能有效识别孤立程度最大的前n个全局孤立点,适用于不同形状的数据集,算法效率高,用户友好,且适用于大型高维数据集的孤立点检测。

关键词: 孤立点检测, 层次聚类, 数据挖掘

Abstract: The existance of outlier always leads to inaccurate, even wrong results in data mining. The outlier detection algorithm now available should be improved including its versatility, effectiveness, user-friendliness, and the performance in processing high-dimensional and large databases. An effective and global outlier detection method is proposed. Agglomerative hierarchical clustering is performed, and the isolated degree of the data can be visually judged by the clustering tree and distance matrix, and the number of the outliers can be determined and the outliers are identified unsupervisedly from the top to down of the clustering tree. Experimental results show that the method can effectively detect the top-n global outliers, and applicable to datasets of various shapes. Experimental results show that the algorithm is efficient, user-friendly, and applicable to detect the outliers for high-dimensional and large databases.

Key words: outlier detection, hierarchical clustering, data mining