计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (19): 166-173.DOI: 10.3778/j.issn.1002-8331.2112-0079

• 模式识别与人工智能 • 上一篇    下一篇

融合拟单层覆盖粗集的集值数据平衡方法研究

吴正江,杨天,郑爱玲,梅秋雨,张亚宁   

  1. 河南理工大学 计算机科学与技术学院,河南 焦作 454003
  • 出版日期:2022-10-01 发布日期:2022-10-01

Study on Set-Valued Data Balancing Method by Semi-Monolayer Covering Rough Set

WU Zhengjiang, YANG Tian, ZHENG Ailing, MEI Qiuyu, ZHANG Yaning   

  1. School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan 454003, China
  • Online:2022-10-01 Published:2022-10-01

摘要: 如今不平衡数据存在生活中各个领域,如何有效地对其分类已经成为研究的热点。传统的过采样与欠采样方法虽然能保证数据的平衡性,但无法克服因数据分布和噪声对数据的分类造成的影响。为了降低数据分布与噪声在集值信息系统中对不平衡数据分类的影响,提出了一种基于拟单层覆盖粗集的过采样与欠采样相结合的模型。通过拟单层覆盖粗集[DA0]与[DE0]下近似将数据主要划分为两个部分,将属于下近似集的部分用BorderlineSMOTE进行过采样,将不属于下近似集的部分用ClusterCentroids进行欠采样,最终将二者合并即为最终数据集。拟单层覆盖粗集是适用于集值信息系统的高近似质量、快速计算的模型,高近似质量可以使其保留尽可能多的可靠数据来保证模型的泛化能力。通过混合处理方式,不仅能够降低噪声数据对BorderlineSMOTE的影响,还能通过ClusterCentroids极大程度地保留被过滤数据的信息完整性。通过相关对比实验,采用ExtraTree、DecisionTree、FGCNN等方法,验证了该模型的有效性。

关键词: 拟单层覆盖粗集, 不平衡数据, 近似集, 混合处理, 过采样, 欠采样

Abstract: Nowadays, imbalanced data exist in all areas of life, and how to effectively classify it has become a hot topic of studies. Traditional methods of over-sampling and under-sampling ensure balanced data, but cannot overcome the effects on the classification of the data due to data distribution and noise. To reduce the influence of data distribution and noise on the classification of imbalanced data in set-valued information systems, a new method combining oversampling and under-sampling based on semi-monolayer covering rough set is proposed. The data are divided into two main parts by applying semi-monolayer covering rough set [DA0] and [DE0] lower approximation, the part be-longing to the lower approximation set is oversampled by BorderlineSMOTE, the part not belonging to the lower approximation set is under-sampled by ClusterCentroids, and finally, the two are combined to the final data set. Semi-monolayer covering rough set is a high approximation quality, a fast computational model which suitable for set-valued information systems. The high approximation quality allows it to retain as much reliable data as possible to ensure the generalization capability of the model. The hybrid approach not only reduces the impact of noisy data on BorderlineSMOTE but also preserves the information integrity of the filtered-out data to a great extent through ClusterCentroids. Finally, the effectiveness of the model is verified through relevant comparative experiments using ExtraTree, DecisionTree and FGCNN.

Key words: semi-monolayer covering rough set, imbalanced data, approximation set, hybrid approach, over-sampling, under-sampling