Adaptive Interpolation and Feature Compression for Small Sample Data Classification Study

doi:10.3778/j.issn.1002-8331.2012-0152

Abstract

Abstract: The problem of category imbalance and dimensional explosion in big data seriously affects the prediction efficiency and classification accuracy of algorithms. Therefore, a classification method ASE-RFXT based on interpolation and feature compression under big data is proposed. Firstly, the interpolation center of ADASYN（adaptive synthetic sampling approach） is improved to reduce the introduction of noise and improve the distribution of minority samples. Secondly, it improves ReliefF and combines with the integrated algorithm XGDT（extreme gradient dart tree） for parallel weighting of features, which reduces the influence of weights by outliers and makes the evaluation more accurate. Finally, it filters low weight redundant features by the correlation between the features, and compresses the features by SFS（sequential forward selection） with the classification accuracy of XGDT as the evaluation index. Experimental results show that the ASE-RFXT algorithm can reduce the feature dimensionality, save training time, and improve the accuracy of classification of unbalanced data.

Key words: extreme gradient boosting, feature selection, adaptive sampling, feature weighted

摘要： 大数据的类别不平衡与维度爆炸问题严重影响着算法的预测效率和分类精度。因此，提出了一种基于插值与特征压缩的大数据分类方法ASE-RFXT。改进ADASYN（adaptive synthetic sampling approach）的插值中心，减少了噪声的引入，改善了少数类样本的分布。改进ReliefF（特征权重法），并将它与集成算法XGDT（extreme gradient dart tree）结合对特征进行并行加权，减少了权重受异常值的影响，使得评估更加准确。利用特征之间的相关性过滤低权重冗余特征，以XGDT的分类精度为评价指标通过SFS（sequential forward selection）压缩特征。实验结果表明ASE-RFXT方法可以降低特征维度，节约训练时间，提高不平衡小样本数据的分类精度。

关键词: 极限梯度提升, 特征选择, 自适应采样, 特征加权

SUN Yongming, YANG Jin. Adaptive Interpolation and Feature Compression for Small Sample Data Classification Study[J]. Computer Engineering and Applications, 2022, 58(1): 106-112.

孙永明, 杨进. 自适应插值与特征压缩的小样本数据分类研究[J]. 计算机工程与应用, 2022, 58(1): 106-112.

References

[1] ZHOU P，HU X，LI P，et al.Online feature selection for high-dimensional class-imbalanced data[J].Knowledge-Based Systems，2017，136：187-199.
[2] 张忠林，曹婷婷.基于重采样与特征选择的不均衡数据分类算法[J].小型微型计算机系统，2020，41（6）：1327-1333.
ZHANG Z L，CHAO T T.Unbalanced data classification algorithm based on resampling and feature selection[J].Journal of Chinese Computer Systems，2020，41（6）：1327-1333.
[3] ANDERSON R，SIOME G.Multiclass from binary：Expanding one-versus-all，one-versus-one and ecoc-based approaches[J].IEEE Transactions on Neural Networks and Learning Systems，2014，25（2）：289-302.
[4] FRIEDMAN J H.Greedy function approximation：A gradient boosting machine[J].Annals of Statistics，2000，29（5）：1189-1232.
[5] CHEN T Q，GUESTRIN C.XGBoost：A scalable tree boosting system[C]//Proceedings of ACM SigKDD International Conference on Knowledge Discovery Data Mining，2016：785-794.
[6] CHAWLA N V，BOWYER K W，HALL L O，et al.SMOTE：Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research，2002，16：321-357.
[7] HAN H，WANG W Y，MAO B H.Borderline-SMOTE：A new over-sampling method in imbalanced data sets learning[J].Advances in Intelligent Computing，2005，36：878-887.
[8] HE H，BAI Y，GARCIA E A，et al.ADASYN：Adaptive synthetic sampling approach for imbalanced learning[C]//2008 IEEE International Joint Conference on Neural Networks（IEEE World Congress on Computational Intelligence），2008.
[9] TOMEK I.Tow modifications of CNN[J].IEEE Transactions on Systems Man and Communications，1996，SMC-6：769-772.
[10] WILSON D L.Asymptotic properties of nearest neighbor rules using edited data[J].IEEE Transactions on Systems Man & Cybernetics，1972，SMC-2（3）：408-421.
[11] ZHOU T，LU H L，WHANG W W，et al.GA-SVM based feature selection and parameter optimitzaion in hospitalization expense modeling[J].Applied Soft Computing，2019，75：323-332.
[12] 周志华.机器学习[M].北京：清华大学出版社，2016.
ZHOU Z H.Machine learning[M].Beijing：Tsinghua University Press，2016.
[13] BARBU A，SHE Y，DING L，et al.Feature selection with annealing for computer vision and big data learning[J].IEEE Transactions on Pattern Analysis and Machine Intellligence，2017，39（2）：272-286.
[14] EFRON B，HASTIE T，JOHNSTONE I，et al.Least angle regression[J].The Annals of Statistics，2004，32（2）：407-499.
[15] 张文杰，蒋烈辉.一种基于遗传算法优化的大数据特征选择方法[J].计算机应用研究，2020，37（1）：50-52.
ZHANG W J，JIANG L H.Using genetic algorithm for feature selection optimization on big data processing[J].Application Research of Computers，2020，37（1）：50-52.
[16] 初蓓，李占山，张梦林，等.基于森林优化特征选择算法的改进研究[J].软件学报，2018，29（9）：2545-2558.
CHU B，LI Z S，ZHANG ML，et al.Research on improvements of feature selection using forest optimization algorithm[J].Journal of Software，2018，29（9）：2545-2558.
[17] TABAKHI S，MORADI P，AKHLAGHIAN F.An unsupervised feature selection algorithm based on ant colony optimization[J].Engineering Applications of Artificial Intelligence，2014，32（6）：112-123.
[18] 周传华，柳智才，丁敬安，等.基于filter+wrapper模式的特征选择算法[J].计算机应用研究，2019，36（7）：1975-1979.
ZHOU C H，LIU Z C，DING J A，et al.Feature selection algorithm based on filter + wrapper pattern[J].Application Research of Computers，2019，36（7）：1975-1979.
[19] 李校林，吴腾，郭有庆.融合邻域判别指数的混合式特征选择算法[J].小型微型计算机系统，2019，40（11）：2285-2290.
LI X L，WU T，GUO YQ.Hybrid feature selection algorithm based on neighborhood discriminant index[J].ournal of Chinese Computer Systems，2019，40（11）：2285-2290.
[20] 张爱武，董喆，康孝岩.基于XGBoost的机载激光雷达与高光谱影像结合的特征选择算法[J].中国激光，2019，46（4）：142-150.
ZHANG A W，DONG Z，KANG X Y.Feature selection algorithms of airborne LiDAR combined with hyperspectral images based on XGBoost[J].Chinese Journal of Lasers，2019，46（4）：142-150.
[21] WANG R CHEN F L，CHEN Z Y，et al.StudentLife：Assessing mental health，academic performance and behavioral trends of college students using smartphones[C]//Proceedings of the ACM Conference on Ubiquitous Computing，2014：1-14.
[22] 王丰，王亚沙，王江涛，等.基于智能手机感知数据的心理压力评估方法[J].计算机研究与发展，2019，56（3）：611-622.
WANG F，WANG Y S，WANG J T，et al.Mental stress assessment approach based on smartphone sensing data[J].Journal of Computer Research and Development，2019，56（3）：611-622.
[23] ZHANG Y，SONG X，GONG D.A return-cost-based binary firefly algorithm for feature selection[J].Information Sciences，2017，418：561-574.
[24] MAFARJA M M，MIRJALILI S.Hybrid whale optimization algorithm with simulated annealing for feature selection[J].Neurocomputing，2017，260：302-312.