基于改进SMOTE的制造过程不平衡数据分类策略

doi:10.3778/j.issn.1002-8331.2110-0266

摘要/Abstract

摘要： 不平衡数据分析是智能制造的关键技术之一，其分类问题已成为机器学习和数据挖掘的研究热点。针对目前不平衡数据过采样策略中人工合成数据边缘化且需要降噪处理的问题，提出一种基于改进SMOTE（synthetic minority oversampling technique）和局部离群因子（local outlier factor，LOF）的过采样算法。首先对整个数据集进行[K]-means聚类，筛选出高可靠性样本进行改进SMOTE算法过采样，然后采用LOF算法删除误差大的人工合成样本。在4个UCI不平衡数据集上的实验结果表明，该方法对不平衡数据中少数类的分类能力更强，有效地克服了数据边缘化问题，将算法应用于磷酸生产中的不平衡数据，实现了该不平衡数据的准确分类。

关键词: 不平衡数据, 过采样, 局部离群因子, 聚类, 合成少数过采样技术（SMOTE）

Abstract: Imbalanced data analysis is one of the key technologies of intelligent manufacturing, and its classification problem has become a research hotspot in machine learning and data mining. Aiming at the problem of artificial synthetic data marginalization and noise reduction in the current imbalanced data oversampling strategy, this paper proposes an over-
sampling algorithm based on improved SMOTE（synthetic minority oversampling technique） and LOF（local outlier factor）. Firstly, perform [K]-means clustering on the entire data set, select high-reliability samples for oversampling with the improved SMOTE algorithm, and finally use LOF algorithm to delete artificially synthesized samples with large errors. The experimental results on 4 UCI imbalanced data sets show that the method is effective. The classification ability of minority class in imbalanced data is stronger, which effectively overcomes the problem of data marginalization. The algorithm is applied to imbalanced data in phosphoric acid production, and the accurate classification of imbalanced data in phosphoric acid production is realized.

Key words: imbalanced data, over-sampling, local outlier factor, clustering, synthetic minority oversampling technique （SMOTE）

黎旭, 陈家兑, 吴永明, 宗文泽. 基于改进SMOTE的制造过程不平衡数据分类策略[J]. 计算机工程与应用, 2022, 58(16): 284-291.

LI Xu, CHEN Jiadui, WU Yongming, ZONG Wenze. Classification Strategy of Imbalanced Data in Manufacturing Process Based on Improved SMOTE[J]. Computer Engineering and Applications, 2022, 58(16): 284-291.

参考文献

[1] 李牧南，张璇.我国导入“工业4.0”赋能概念的大型制造企业研发效率研究[J].工业技术经济，2021，40（3）：13-20.
LI M N，ZHANG X.Research on Ｒ＆D efficiency of China large-sized manufacturers energized by “Industry 4.0”[J].Journal of Industrial Technological Economics，2021，40（3）：13-20.
[2] 李艳霞，柴毅，胡友强，等.不平衡数据分类方法综述[J].控制与决策，2019，34（4）：673-688.
LI Y X，CHAI Y，HU Y Q，et al.Review of imbalanced data classification methods[J].Control and Decision，2019，34（4）：673-688.
[3] CHAWLA N V，BOWYER K W，HALL L O，et al.SMOTE：synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research，2002，16：321-357.
[4] ELREEDY D，ATIYA A F.A comprehensive analysis of synthetic minority oversampling technique （SMOTE） for handling class imbalance[J].Information Sciences，2019，505：32-64.
[5] HE H，BAI Y，GARCIA E A，et al.ADASYN：adaptive synthetic sampling approach for imbalanced learning[C]//2008 IEEE International Joint Conference on Neural Networks（IEEE World Congress on Computational Intelligence），2008：1322-1328.
[6] YANG L，ZHANG J，WANG X，et al.An improved ELM-based and data preprocessing integrated approach for phishing detection considering comprehensive features[J].Expert Systems with Applications，2021，165：113863.
[7] DOUZAS G，BACAO F，LAST F.Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE[J].Information Sciences，2018，465：1-20.
[8] GAN G，MA C，WU J.Data clustering：theory，algorithms，and applications[M].[S.l.]：Society for Industrial and Applied Mathematics，2020.
[9] ZHU T，LIN Y，LIU Y.Improving interpolation-based over-
sampling for imbalanced data learning[J].Knowledge-Based Systems，2020，187：104826.
[10] WANG W，LU P.An efficient switching median filter based on local outlier factor[J].IEEE Signal Processing Letters，2011，18（10）：551-554.
[11] CHEN Z，XU K，WEI J，et al.Voltage fault detection for lithiumion battery pack using local outlier factor[J].Measurement，2019，146：544-556.
[12] ALSINI R，ALMAKRAB A，IBRAHIM A，et al.Improving the outlier detection method in concrete mix design by combining the isolation forest and local outlier factor[J].Construction and Building Materials，2021，270：121396.
[13] XIA S，ZHENG Y，WANG G，et al.Random space division sampling for label-noisy classification or imbalanced classification[J].IEEE Transactions on Cybernetics，2021.DOI：10.
1109/TCYB.2021.3070005.
[14] BREUNIG M M，KRIEGEL H P，NG R T，et al.LOF：identifying density-based local outliers[C]//2000 ACM SIGMOD International Conference on Management of Data，2000：93-104.
[15] LUQUE A，CARRASCO A，MARTíN A，et al.The impact of class imbalance in classification performance metrics based on the binary confusion matrix[J].Pattern Recognition，2019，91：216-231.
[16] MIRZAEI B，NIKPOUR B，NEZAMABADI-POUR H.CDBH：a clustering and density-based hybrid approach for imbalanced data classification[J].Expert Systems with Applications，2021，164：114035.
[17] ZHU Q，FENG J，HUANG J.Natural neighbor：a self-adaptive neighborhood method without parameter K[J].Pattern Recognition Letters，2016，80：30-36.
[18] ASNIAR，MAULIDEVI N U，SURENDRO K.SMOTE-LOF for noise identification in imbalanced data classification[J].Journal of King Saud University-Computer and Information Sciences，2022，34（6）：3413-3423.
[19] 张俊，韩喜超，潘继斐，等.国内湿法磷酸净化技术的工业化应用[J].磷肥与复肥，2020，35（11）：30-31.
ZHANG J，HAN X C，PAN J F，et al.Industrial application of domestic WPA purification technology[J].Phosphate & Compound Fertilizer，2020，35（11）：30-31.