Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (21): 102-108.DOI: 10.3778/j.issn.1002-8331.2010-0096

Previous Articles     Next Articles

Online Streaming Feature Selection Algorithm Using Neighborhood Information Interaction

LI Longzhu, LIN Yaojin, LYU Yan, LU Shun, WANG Chenxi   

  1. 1.School of Computer Science, Minnan Normal University, Zhangzhou, Fujian 363000, China
    2.Key Laboratory of Data Science and Intelligence Application, Minnan Normal University, Zhangzhou, Fujian 363000, China
  • Online:2021-11-01 Published:2021-11-04

利用邻域信息交互的在线流特征选择算法

李珑珠,林耀进,吕彦,卢舜,王晨曦   

  1. 1.闽南师范大学 计算机学院,福建 漳州 363000
    2.数据科学与智能应用福建省高等学校重点实验室,福建 漳州 363000

Abstract:

In the open dynamic environment, the task of machine learning faces the high dimensionality and dynamicity of feature space. At present, the existing online streaming feature selection algorithms generally consider the importance of feature and the redundancy between features, and ignore the interaction between features. Feature interaction denotes a feature irrelevant or weakly relevant with the labels by itself, but when it is combined with some other features, it will be strongly correlated with the labels. Based on this, an online streaming feature selection algorithm based on neighborhood information interaction is proposed, which includes online interaction feature selection and online redundant feature deletion, i.e., calculating the interaction strength between the new arrived feature and the selected feature subset, and deleting redundant features using pair-wise comparison mechanism. Finally, extensive experiments are conducted on ten data sets, and the results show the proposed algorithm is effective.

Key words: feature selection, streaming feature, feature interaction, neighborhood mutual information

摘要:

开放动态环境下的机器学习任务面临着数据特征空间的高维性和动态性。目前已有在线流特征选择算法基本仅考虑特征的重要性和冗余性,忽略了特征的交互性。特征交互是指那些本身与标签单独统计时呈现无关或弱相关,但与其他特征结合时却能与标签呈强相关的特征。基于此,提出一种基于邻域信息交互的在线流特征选择算法,该算法分为在线交互特征选择和在线冗余特征剔除两个阶段,即直接计算新到特征与整个已选特征子集的交互强弱程度,以及利用成对比较机制剔除冗余特征。在10个数据集上的实验结果表明了所提算法的有效性。

关键词: 特征选择, 流特征, 特征交互, 邻域互信息