计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (23): 125-135.DOI: 10.3778/j.issn.1002-8331.2305-0304

• 模式识别与人工智能 • 上一篇    下一篇

邻域信息修正的不完整数据多填充集成分类方法

朱先远,严远亭,张燕平   

  1. 1.安徽商贸职业技术学院 信息与人工智能学院,安徽 芜湖 241002
    2.安徽大学 计算机科学与技术学院,合肥 230601
  • 出版日期:2023-12-01 发布日期:2023-12-01

Multiple Imputation-Revision Ensemble Classification with Neighborhood Information

ZHU Xianyuan, YAN Yuanting, ZHANG Yanping   

  1. 1.School of Information and Artificial Intelligence, Anhui Business College of Vocational Technology, Wuhu, Anhui 241002, China
    2.School of Computer Science and Technology, Anhui University, Hefei 230601, China
  • Online:2023-12-01 Published:2023-12-01

摘要: 不完整数据集分类前需要对缺失值先填充。目前已有了一些经典的缺失值填充算法,如均值填充、[K]近邻填充等。它们各有优势,但这些算法对缺失值的估算易受到与缺失值相关性不大的其他数据干扰,影响缺失值填充效果,进而影响后续分类性能。针对该问题,提出一种邻域信息修正不完整数据多填充集成分类方法。该方法通过嵌入修正填充模块来优化填充过程,利用纯度和邻域半径筛选出待修正填充的近邻数据样本,并根据这些近邻数据样本对缺失值进行修正填充,进一步提升填充精度。同时,融合了多种经典填充算法优势,利用多填充的数据多样性,通过引入集成学习提升分类精确度。实验结果表明,该方法对基准数据集上的缺失值填充效果、数据分类精确度都优于对比方法,同时在真实不完整数据集上也表现出更好的分类精确度。

关键词: 不完整数据分类, 修正填充, 邻域信息, 集成学习

Abstract: Missing value imputation is one of the important preprocess techniques for incomplete data classification. Numerous missing value imputation methods have been proposed over the past decades. However, these algorithms are prone to being affected by other data that is not related to the missing values, leading to imprecise imputation results and degradation of subsequent classification performance. To address this issue, this paper proposes an incomplete data classification method based on multiple imputation-revision ensemble with local information. The method incorporates an imputation-revision module that selects neighbor of the sample to be corrected and imputed based on neighborhood purity and neighborhood radius, resulting in better imputation accuracy. The method also integrates the strengths of multiple classic imputation algorithms and utilizes the diversity of multiple imputed dataset to enhance classification accuracy via ensemble learning. Experimental results demonstrate that this method outperforms compared methods in terms of imputation accuracy and classification performance on benchmark datasets, and it also exhibits superior classification accuracy on real-world incomplete datasets.

Key words: incomplete data classification, imputation-revision, local information, ensemble learning