计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (22): 199-207.DOI: 10.3778/j.issn.1002-8331.2105-0510

• 模式识别与人工智能 • 上一篇    下一篇

基于多因子粒子群的高维数据特征选择算法

林炜星,王宇嘉,陈万芬,梁海娜   

  1. 上海工程技术大学 电子电气工程学院,上海 201620
  • 出版日期:2021-11-15 发布日期:2021-11-16

High-Dimensional Data Feature Selection Algorithm Based on Multifactor Particle Swarm Optimization

LIN Weixing, WANG Yujia, CHEN Wanfen, LIANG Haina   

  1. School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
  • Online:2021-11-15 Published:2021-11-16

摘要:

特征选择是机器学习和数据挖掘领域中一项重要的数据预处理技术,它旨在最大化分类任务的精度和最小化最优子集特征个数。运用粒子群算法在高维数据集中寻找最优子集面临着陷入局部最优和计算代价昂贵的问题,导致分类精度下降。针对此问题,提出了基于多因子粒子群算法的高维数据特征选择算法。引入了进化多任务的算法框架,提出了一种两任务模型生成的策略,通过任务间的知识迁移加强种群交流,提高种群多样性以改善易陷入局部最优的缺陷;设计了基于稀疏表示的初始化策略,在算法初始阶段设计具有稀疏表示的初始解,降低了种群在趋向最优解集时的计算开销。在6个公开医学高维数据集上的实验结果表明,所提算法能够有效实现分类任务且得到较好的精度。

关键词: 高维数据, 特征选择, 进化多任务, 粒子群算法(PSO)

Abstract:

Feature selection is an important data preprocessing technique in the field of machine learning and data mining. It aims to maximize the accuracy of classification tasks and minimize the number of optimal subset features. Using the particle swarm algorithm to find the optimal subset in the high-dimensional dataset is faced with the problems of falling into the local optimum and expensive calculations, resulting in a decrease in classification accuracy. To solve this problem, a high-dimensional data feature selection algorithm based on multifactor particle swarm optimization is proposed. Firstly, the evolutionary multi-task algorithm framework is introduced, and a two-task model generation strategy is proposed, which strengthens population communication through knowledge transfer between tasks and improves population diversity to improve the shortcomings that tend to fall into local optimum. Secondly, the design is based on the initial strategy of sparse representation, the initial solution with sparse representation is designed in the initial stage of the algorithm, which reduces the computational cost of the population when it tends to the optimal solution set. The experimental results on 6 public medical high-dimensional datasets show that the proposed algorithm can effectively achieve the classification task and obtain better accuracy.

Key words: high-dimensional data, feature selection, evolutionary multitasking, Particle Swarm Optimization(PSO)