计算机工程与应用 ›› 2013, Vol. 49 ›› Issue (11): 105-109.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于核主成分分析的数据流降维研究

高宏宾,侯  杰,李瑞光   

  1. 五邑大学 计算机学院,广东 江门 529020
  • 出版日期:2013-06-01 发布日期:2013-06-14

Research on dimension reduction of data stream based on kernel principal component analysis

GAO Hongbin, HOU Jie, LI Ruiguang   

  1. School of Computer Science and Technology, Wuyi University, Jiangmen, Guangdong 529020, China
  • Online:2013-06-01 Published:2013-06-14

摘要: 分析了数据流降维算法PCA和KPCA的原理和实现方法。针对在大型数据集上PCA线性降维无法有效实现降维且KPCA的降维效率差,提出了一种新的降维策略GKPCA算法。该算法将数据集先分组,对每一组执行KPCA,然后过滤重新组合数据集,再次应用KPCA算法,达到简化样本空间,降低了时间复杂度和空间复杂度。实验分析表明,GKPCA算法不仅能取得良好的降维效果,而且时间消耗少。

关键词: 核主成分分析, 数据流, 降维

Abstract: Theory and implementation of two data stream dimension reduction algorithms, PCA and KPCA, are analyzed. Due to linear PCA and KPCA can not effectively reduce data stream dimension when applied over large scale stream data, a new dimension reduction technique called GKPCA is proposed. With GKPCA, data sets are first partitioned into groups, and then KPCA is applied over each group. Data sets are filtered and regrouped into a new dataset. KPCA is again evaluated over the new data sets. This process is preceding recursively when some reduction threshold is reached which simplifies data stream sampling space and reduces time and space complexity of KPCA. Experimental analysis over different datasets illustrates that GKPCA can reduce data stream dimension excellently with less time consumption.

Key words: Kernel Principal Component Analysis(KPCA), data stream, dimension reduction