融合拟单层覆盖粗集的集值数据平衡方法研究

doi:10.3778/j.issn.1002-8331.2112-0079

摘要/Abstract

摘要： 如今不平衡数据存在生活中各个领域，如何有效地对其分类已经成为研究的热点。传统的过采样与欠采样方法虽然能保证数据的平衡性，但无法克服因数据分布和噪声对数据的分类造成的影响。为了降低数据分布与噪声在集值信息系统中对不平衡数据分类的影响，提出了一种基于拟单层覆盖粗集的过采样与欠采样相结合的模型。通过拟单层覆盖粗集[DA0]与[DE0]下近似将数据主要划分为两个部分，将属于下近似集的部分用BorderlineSMOTE进行过采样，将不属于下近似集的部分用ClusterCentroids进行欠采样，最终将二者合并即为最终数据集。拟单层覆盖粗集是适用于集值信息系统的高近似质量、快速计算的模型，高近似质量可以使其保留尽可能多的可靠数据来保证模型的泛化能力。通过混合处理方式，不仅能够降低噪声数据对BorderlineSMOTE的影响，还能通过ClusterCentroids极大程度地保留被过滤数据的信息完整性。通过相关对比实验，采用ExtraTree、DecisionTree、FGCNN等方法，验证了该模型的有效性。

关键词: 拟单层覆盖粗集, 不平衡数据, 近似集, 混合处理, 过采样, 欠采样

Abstract: Nowadays, imbalanced data exist in all areas of life, and how to effectively classify it has become a hot topic of studies. Traditional methods of over-sampling and under-sampling ensure balanced data, but cannot overcome the effects on the classification of the data due to data distribution and noise. To reduce the influence of data distribution and noise on the classification of imbalanced data in set-valued information systems, a new method combining oversampling and under-sampling based on semi-monolayer covering rough set is proposed. The data are divided into two main parts by applying semi-monolayer covering rough set [DA0] and [DE0] lower approximation, the part be-longing to the lower approximation set is oversampled by BorderlineSMOTE, the part not belonging to the lower approximation set is under-sampled by ClusterCentroids, and finally, the two are combined to the final data set. Semi-monolayer covering rough set is a high approximation quality, a fast computational model which suitable for set-valued information systems. The high approximation quality allows it to retain as much reliable data as possible to ensure the generalization capability of the model. The hybrid approach not only reduces the impact of noisy data on BorderlineSMOTE but also preserves the information integrity of the filtered-out data to a great extent through ClusterCentroids. Finally, the effectiveness of the model is verified through relevant comparative experiments using ExtraTree, DecisionTree and FGCNN.

Key words: semi-monolayer covering rough set, imbalanced data, approximation set, hybrid approach, over-sampling, under-sampling

吴正江, 杨天, 郑爱玲, 梅秋雨, 张亚宁. 融合拟单层覆盖粗集的集值数据平衡方法研究[J]. 计算机工程与应用, 2022, 58(19): 166-173.

WU Zhengjiang, YANG Tian, ZHENG Ailing, MEI Qiuyu, ZHANG Yaning. Study on Set-Valued Data Balancing Method by Semi-Monolayer Covering Rough Set[J]. Computer Engineering and Applications, 2022, 58(19): 166-173.

参考文献

[1] CHEN B Y，XIA S Y，CHEN Z Z，et al.RSMOTE：A self-adaptive robust SMOTE for imbalanced problems with label noise[J].Information Sciences，2021，553：397-428.
[2] LI J N，ZHU Q S，WU Q W，et al.SMOTE-NaN-DE：Addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution[J].Knowledge-Based Systems，2021，223：107056.
[3] RAMENTOL E，CABALLERO Y，BELLO R，et al.SMOTE-RSB*：A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory[J].International Journal of Computer Applications，2012，33：245-265.
[4] 张壮，王士同.不平衡数据的Takagi-Sugeno-Kang模糊分类集成模型[J].计算机科学与探索，2022，16（6）：1374-1382.
ZHANG Z，WANG S T.Ensemble model of Takagi-Sugeno-Kang fuzzy classifiers for imbalanced data[J].Journal of Frontiers of Computer Science and Technology，2022，16（6）：1374-1382.
[5] 徐剑，王馨月，才子昕，等.价值样本选取的不均衡分类[J].计算机科学与探索，2020，14（3）：401-409.
XU J，WANG X Y，CAI Z X，et al.Imbalance classification based on informative instances selection[J].Journal of Frontiers of Computer Science and Technology，2020，14（3）：401-409.
[6] 陆妙芳，杨有龙.基于密度峰值聚类和径向基函数的过采样算法[J/OL].计算机工程与应用：1-11（2021-05-21）[2021-11-15].https：//kns.cnki.net/kcms/detail/11.2127.TP.20210521.
1100.006.html.
LU M F，YANG Y L.Oversampling algorithm based on density peak clustering and radial basis function[J/OL].Computer Engineering and Applications：1-11（2021-05-21）[2021-11-15].https：//kns.cnki.net/kcms/detail/11.2127.TP.
20210521.1100.006.html.
[7] 谢子鹏，包崇明，周丽华，等.类不平衡数据的EM聚类过采样算法[J/OL].计算机科学与探索：1-14（2021-05-25）[2021-11-15].https：//kns.cnki.net/kcms/detail/11.5602.TP.20210525.
1637.008.html.
XIE Z P，BAO C M，ZHOU L H，et al.EM clustering over-sampling algorithm for class imbalanced data[J/OL].Journal of Frontiers of Computer Science and Technology：1-14（2021-05-25）[2021-11-15].https：//kns.cnki.net/kcms/detail/11.5602.TP.20210525.1637.008.html.
[8] 严远亭，朱原玮，吴增宝，等.构造性覆盖算法的SMOTE过采样方法[J].计算机科学与探索，2020，14（6）：975-984.
YAN Y T，ZHU Y W，WU Z B，et al.Constructive covering algorithm-based SMOTE over-sampling method[J].Journal of Frontiers of Computer Science and Technology，2020，14（6）：975-984.
[9] 陈俊丰，郑中团.WKMeans与SMOTE结合的不平衡数据过采样方法[J].计算机工程与应用，2021，57（23）：106-112.
CHEN J F，ZHENG Z T.Over-sampling method on imbalanced data based on WKMeans and SMOTE[J].Computer Engineering and Applications，2021，57（23）：106-112.
[10] 王乐，韩萌，李小娟，等.不平衡数据集分类方法综述[J].计算机工程与应用，2021，57（22）：42-52.
WANG L，HAN M，LI X J，et al.Review of classification methods for unbalanced data sets[J].Computer Engineering and Applications，2021，57（22）：42-52.
[11] 徐玲玲，迟冬祥.面向不平衡数据集的机器学习分类策略[J].计算机工程与应用，2020，56（24）：12-17.
XU L L，CHI D X.Machine learning classification strategy for imbalanced data sets[J].Computer Engineering and Applications，2020，56（24）：12-17.
[12] GUAN Y，WANG H.Set-valued information systems[J].Information Sciences，2006，176：2507-2525.
[13] GUAN Y Y，XUE P J，QING H H.Attribute reduction and definite decision rules optimization in set-valued decision information systems[J].Systems Engineering and Electronics，2006，28（4）：551-555.
[14] KRYSZKIEWICZ M.Rules in incomplete information systems[J].Information Sciences，2001，113（3/4）：271-292.
[15] OROWSKA E，PAWLAK Z.Representation of nondeterministic information[J].Theoretical Computer Science，1984，29（3/4）：27-39.
[16] STEFANOWSKI J，TSOUKIàS A.Incomplete information tables and rough classification[J].Computational Intelligence，2001，17（3）：545-566.
[17] WANG G Y.Extension of rough set under incomplete information systems[C]//Proceedings of the IEEE World Congress on Computational Intelligence and IEEE International Conference on Fuzzy Systems，2002：1098-1103.
[18] WU Z J，CHEN N，GAO Y.Semi-monolayer cover rough set：Concept，property and granular algorithm[J].Information Sciences，2018，456：97-112.
[19] WU Z J，WANG H，CHEN N，et al.Semi-monolayer covering rough set on set-valued information systems and its efficient computation[J].International Journal of Approximate Reasoning，2021，130：83-106.
[20] HUI H，WANG W Y，MAO B H.Borderline-SMOTE：A new over-sampling method in imbalanced data sets learning[C]//Proceedings of the 2005 International Conference on Advances in Intelligent Computing，2005：878-887.
[21] BATISTA G E A P A，BAZZAN A L C，MONARD M C.Balancing training data for automated annotation of key-words：A case study[C]//Proceedings of the II Brazilian Workshop on Bioinformatics，2008.
[22] SINGH H，KAUR K.New method for finding initial cluster centroids in k-means algorithm[J].International Journal of Computer Applications，2013，74（6）：27-30.
[23] CHAWLA N V，BOWYER K W，HALL L O，et al.SMOTE：Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research，2002，16（1）：321-357.
[24] TOMEK I.Two modifications of CNN[J].IEEE Transactions on Systems Man & Cybernetics，1976，6（11）：769-772.
[25] GEURTS P，ERNST D，WEHENKEL L.Extremely randomized trees[J].Machine Learning，2006，63（1）：3-42.
[26] OLANOW C W，KOLLER W C.An algorithm（decision tree） for the management of Parkinson’s disease：Treatment guidelines[J].Neurology，1998，50（3）：S1.
[27] LIU B，TANG R M，CHEN Y Z，et al.Feature generation by convolutional neural network for click-through rate prediction[J].Association for Computing Machinery，2019，11：1119-1129.