邻域信息修正的不完整数据多填充集成分类方法

doi:10.3778/j.issn.1002-8331.2305-0304

摘要/Abstract

摘要： 不完整数据集分类前需要对缺失值先填充。目前已有了一些经典的缺失值填充算法，如均值填充、[K]近邻填充等。它们各有优势，但这些算法对缺失值的估算易受到与缺失值相关性不大的其他数据干扰，影响缺失值填充效果，进而影响后续分类性能。针对该问题，提出一种邻域信息修正不完整数据多填充集成分类方法。该方法通过嵌入修正填充模块来优化填充过程，利用纯度和邻域半径筛选出待修正填充的近邻数据样本，并根据这些近邻数据样本对缺失值进行修正填充，进一步提升填充精度。同时，融合了多种经典填充算法优势，利用多填充的数据多样性，通过引入集成学习提升分类精确度。实验结果表明，该方法对基准数据集上的缺失值填充效果、数据分类精确度都优于对比方法，同时在真实不完整数据集上也表现出更好的分类精确度。

关键词: 不完整数据分类, 修正填充, 邻域信息, 集成学习

Abstract: Missing value imputation is one of the important preprocess techniques for incomplete data classification. Numerous missing value imputation methods have been proposed over the past decades. However, these algorithms are prone to being affected by other data that is not related to the missing values, leading to imprecise imputation results and degradation of subsequent classification performance. To address this issue, this paper proposes an incomplete data classification method based on multiple imputation-revision ensemble with local information. The method incorporates an imputation-revision module that selects neighbor of the sample to be corrected and imputed based on neighborhood purity and neighborhood radius, resulting in better imputation accuracy. The method also integrates the strengths of multiple classic imputation algorithms and utilizes the diversity of multiple imputed dataset to enhance classification accuracy via ensemble learning. Experimental results demonstrate that this method outperforms compared methods in terms of imputation accuracy and classification performance on benchmark datasets, and it also exhibits superior classification accuracy on real-world incomplete datasets.

Key words: incomplete data classification, imputation-revision, local information, ensemble learning

朱先远, 严远亭, 张燕平. 邻域信息修正的不完整数据多填充集成分类方法[J]. 计算机工程与应用, 2023, 59(23): 125-135.

ZHU Xianyuan, YAN Yuanting, ZHANG Yanping. Multiple Imputation-Revision Ensemble Classification with Neighborhood Information[J]. Computer Engineering and Applications, 2023, 59(23): 125-135.

参考文献

[1] HAN J，PEI J，KAMBER M.Data mining：concepts and techniques[M].[S.l.]：Elsevier，2011.
[2] DUDA R O，HART P E，STORK D G.Pattern classification[M].[S.l.]：John Wiley & Sons，2012.
[3] 孟军，李锐，郝涵.基于相交邻域粗糙集的基因微阵列数据分类[J].计算机科学，2015，42（6）：37-40.
MENG J，LI R，HAO H.Gene microarray data classification based on intersection neighborhood rough set[J].Journal of Computer Science，2015，42（6）：37-40.
[4] 张建军，张天成，隋宇婷，等.基于极限学习机（ELM）岭回归的DNA微阵列数据填补[J].小型微型计算机系统，2014，35（10）：2357-2362.
ZHANG J J，ZHANG T C，SUI Y T，et al.DNA microarray data filling based on extreme learning machine ridge regression[J].Journal of Computer Systems，2014，35（10）：2357-2362.
[5] LIU S，CHEN L，NI L M.Anomaly detection from incomplete data[J].ACM Transactions on Knowledge Discovery from Data，2014，9（2）：1-22.
[6] LIU J，MUSIALSKI P，WONKA P，et al.Tensor completion for estimating missing values in visual data[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2013，35（1）：208-220.
[7] LAKSHMINARAYAN K，HARP S A，SAMAD T.Imputation of missing data in industrial databases[J].Applied Intelligence，1999，11（3）：259-275.
[8] SONG Q，SHEPPERD M，CHEN X，et al.Can k-NN imputation improve the performance of C4.5 with small software project data sets a comparative evaluation[J].Journal of Systems and Software，2008，81（12）：2361-2370.
[9] FARHANGFAR A，KURGAN L，DY J.Impact of imputation of missing values on classification error for discrete data[J].Pattern Recognition，2008，41（12）：3692-3705.
[10] 黄帷，闵帆，任杰.基于协同过滤加权预测的主动学习缺失值填补算法[J].南京大学学报（自然科学），2018，54（4）：758-765.
HUANG W，MIN F，REN J.Missing value imputation with active learning based on collaborative filtering weighted prediction[J].Journal of Nanjing University（Natural Science），2018，54（4）：758-765.
[11] 刘永楠，李建中，高宏.海量不完整数据的核心数据选择问题的研究[J].计算机学报，2018，41（4）：915-930.
LIU Y N，LI J Z，GAO H.Research on core-sets selection on massive incomplete data[J].Chinese Journal of Computer，2018，41（4）：915-930.
[12] FARHANGFAR A，KURGAN L A，PEDRYCZ W.Experimental analysis of methods for imputation of missing values in databases[C]//Proceedings of SPIE 5421，Intelligent Computing：Theory and Applications II，2004.
[13] GRZYMALA-BUSSE J W，HU M.A comparison of several approaches to missing attribute values in data mining[C]//Proceedings of the International Conference on Rough Sets and Current Trends in Computing，2000：329-334.
[14] 辛利柯，杨琬琪，杨明.基于判别稀疏性表示的不完整多视图分类[J].计算机科学与探索，2021，15（10）：1938-1948.
XIN L K，YANG W Q，YANG M.Incomplete multi-view classification via discriminative and sparse representation[J].Journal of Frontiers of Computer Science and Technology，2021，15（10）：1938-1948.
[15] DONDERS A R T，VAN DER HEIJDEN G J M G，STIJNEN T，et al.Review：a gentle introduction to imputation of missing values[J].Journal of Clinical Epidemiology，2006，59（10）：1087-1091.
[16] MAZUMDER R，HASTIE T，TIBSHIRANI R.Spectral regularization algorithms for learning large incomplete matrices[J].Journal of Machine Learning Research，2010，11（11）：2287-2322.
[17] RANJBAR M，MORADI P，AZAMI M，et al.An imputation-based matrix factorization method for improving accuracy of collaborative filtering systems[J].Engineering Applications of Artificial Intelligence，2015，46：58-66.
[18] BATISTA G E，MONARD M C.A study of K-nearest neighbour as an imputation method[C]//International Conference on Health Information Science，2002.
[19] KEERIN P，KURUTACH W，BOONGOEN T.Cluster-based KNN missing value imputation for DNA microarray data[C]//2012 IEEE International Conference on Systems，Man，and Cybernetics（SMC），2012.
[20] TROYANSKAYA O G，CANTOR M，SHERLOCK G，et al.Missing value estimation methods for DNA microarrays[J].Bioinformatics，2001，17（1）：39-48.
[21] LIU X，ZHANG Z.A two-stage deep autoencoder-based missing data imputation method for wind farm SCADA data[J].IEEE Sensors Journal，2021，21（9）：10933-10945.
[22] 严远亭，吴亚亚，赵姝，等.构造性覆盖下不完整数据修正填充方法[J].智能系统学报，2019，14（6）：1225-1232.
YAN Y T，WU Y Y，ZHAO S，et al.Improving missing data recovery with a constructive covering algorithm[J].CAAI Transactions on Intelligent Systems，2019，14（6）：1225-1232.
[23] CHOUDHURY S J，PAL N R.Imputation of missing data with neural networks for classification[J].Knowledge-Based Systems，2019，182：104838.
[24] SPINELLI I，SCARDAPANE S，UNCINI A.Missing data imputation with adversarially-trained graph convolutional networks[J].Neural Networks，2020，129：249-260.
[25] CHEN Z，LIN T，XIA X，et al.A synthetic neighborhood generation based ensemble learning for the imbalanced data classification[J].Applied Intelligence，2017，48（8）：2441-2457.
[26] HANSEN L K，SALAMON P.Neural network ensembles[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，1990，12（10）：993-1001.
[27] FREUND Y，SCHAPIRE R，ABE N.A short introduction to boosting[J].Journal-Japanese Society for Artificial Intelligence，1999，14（5）：771-780.
[28] BREIMAN L.Bagging predictors[J].Machine Learning，1996，24（2）：123-140.
[29] STOSIC D，LUDERMIR T.Voting based q-generalized extreme learning machine[J].Neurocomputing，2016，174：1021-1030.
[30] YIANILOS P.Data structures and algorithms for nearest neighbor search in general metric spaces[C]//Proceedings of the Fourth Annual ACM-SIAM Symp on Discrete Algorithms，Austin，TX，1993：311-321.