Computer Engineering and Applications ›› 2013, Vol. 49 ›› Issue (24): 122-129.

Previous Articles     Next Articles

Random projection algorithm for outlier mining technology research

LI Qiao, ZHOU Yinglian, HUANG Sheng, MA Xiang   

  1. School of Information Science and Engineering, Hunan International Economics University, Changsha 410205, China
  • Online:2013-12-15 Published:2013-12-11

对随机投影算法的离群数据挖掘技术研究

李  桥,周莹莲,黄  胜,马  翔   

  1. 湖南涉外经济学院 信息科学与工程学院,长沙 410205

Abstract: Outlier mining in d-dimensional point sets is currently one of the hot areas of data mining. The current outlier mining approaches based on the distance or the nearest neighbor result in the poor mining results. To solve this problem, this paper investigates the use of angle-based outlier factor in mining high dimensional outliers. It proposes a novel random projection-based technique that is able to estimate the angle-based outlier factor for all data points in time near-linear in the size of the data. Also, the approach is suitable to be performed in parallel environment to achieve a parallel speedup. It introduces a theoretical analysis of the quality of approximation to guarantee the reliability of the algorithm. The empirical experiments on synthetic and real world data sets demonstrate that the approach is efficient and scalable to very large high-dimensional data sets.

Key words: outlier data mining, angle, random projection algorithm, near-linear time, reliability, efficiency

摘要: [d]维点集离群数据挖掘技术是目前数据挖掘领域的研究热点之一。当前基于距离或最近邻概念进行离群数据挖掘时,在高维数据情况下的挖掘效果不佳,鉴于此,将基于角度的离群因子应用到高维离群数据挖掘中,提出一种新的基于随机投影算法的离群数据挖掘方案,它只需要用接近线性时间的方法就能预测所有数据点的基于角度的离群因子。该方法可以用于并行环境进行并行加速。对近似质量进行了理论分析,以保证算法的可靠性。合成和真实数据集实验结果表明,对超高维数据集,该方法效率高、可伸缩性强。

关键词: 离群数据挖掘, 角度, 随机投影算法, 接近线性时间, 可靠性, 效率