Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (1): 106-112.DOI: 10.3778/j.issn.1002-8331.2012-0152

• Big Data and Cloud Computing • Previous Articles     Next Articles

Adaptive Interpolation and Feature Compression for Small Sample Data Classification Study

SUN Yongming, YANG Jin   

  1. School of Science, University of Shanghai for Science and Technology, Shanghai 200093, China
  • Online:2022-01-01 Published:2022-01-06

自适应插值与特征压缩的小样本数据分类研究

孙永明,杨进   

  1. 上海理工大学 理学院,上海 200093

Abstract: The problem of category imbalance and dimensional explosion in big data seriously affects the prediction efficiency and classification accuracy of algorithms. Therefore, a classification method ASE-RFXT based on interpolation and feature compression under big data is proposed. Firstly, the interpolation center of ADASYN(adaptive synthetic sampling approach) is improved to reduce the introduction of noise and improve the distribution of minority samples. Secondly, it improves ReliefF and combines with the integrated algorithm XGDT(extreme gradient dart tree) for parallel weighting of features, which reduces the influence of weights by outliers and makes the evaluation more accurate. Finally, it filters low weight redundant features by the correlation between the features, and compresses the features by SFS(sequential forward selection) with the classification accuracy of XGDT as the evaluation index. Experimental results show that the ASE-RFXT algorithm can reduce the feature dimensionality, save training time, and improve the accuracy of classification of unbalanced data.

Key words: extreme gradient boosting, feature selection, adaptive sampling, feature weighted

摘要: 大数据的类别不平衡与维度爆炸问题严重影响着算法的预测效率和分类精度。因此,提出了一种基于插值与特征压缩的大数据分类方法ASE-RFXT。改进ADASYN(adaptive synthetic sampling approach)的插值中心,减少了噪声的引入,改善了少数类样本的分布。改进ReliefF(特征权重法),并将它与集成算法XGDT(extreme gradient dart tree)结合对特征进行并行加权,减少了权重受异常值的影响,使得评估更加准确。利用特征之间的相关性过滤低权重冗余特征,以XGDT的分类精度为评价指标通过SFS(sequential forward selection)压缩特征。实验结果表明ASE-RFXT方法可以降低特征维度,节约训练时间,提高不平衡小样本数据的分类精度。

关键词: 极限梯度提升, 特征选择, 自适应采样, 特征加权