计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (16): 284-291.DOI: 10.3778/j.issn.1002-8331.2110-0266

• 工程与应用 • 上一篇    下一篇

基于改进SMOTE的制造过程不平衡数据分类策略

黎旭,陈家兑,吴永明,宗文泽   

  1. 1.贵州大学 现代制造技术教育部重点实验室,贵阳 550025
    2.贵州大学 机械工程学院,贵阳 550025
    3.贵州大学 公共大数据国家重点实验室,贵阳 550025
  • 出版日期:2022-08-15 发布日期:2022-08-15

Classification Strategy of Imbalanced Data in Manufacturing Process Based on Improved SMOTE

LI Xu, CHEN Jiadui, WU Yongming, ZONG Wenze   

  1. 1.Key Laboratory of Advanced Manufacturing Technology of Ministry of Education, Guizhou University, Guiyang 550025, China
    2.College of Mechanical Engineering, Guizhou University, Guiyang 550025, China
    3.State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
  • Online:2022-08-15 Published:2022-08-15

摘要: 不平衡数据分析是智能制造的关键技术之一,其分类问题已成为机器学习和数据挖掘的研究热点。针对目前不平衡数据过采样策略中人工合成数据边缘化且需要降噪处理的问题,提出一种基于改进SMOTE(synthetic minority oversampling technique)和局部离群因子(local outlier factor,LOF)的过采样算法。首先对整个数据集进行[K]-means聚类,筛选出高可靠性样本进行改进SMOTE算法过采样,然后采用LOF算法删除误差大的人工合成样本。在4个UCI不平衡数据集上的实验结果表明,该方法对不平衡数据中少数类的分类能力更强,有效地克服了数据边缘化问题,将算法应用于磷酸生产中的不平衡数据,实现了该不平衡数据的准确分类。

关键词: 不平衡数据, 过采样, 局部离群因子, 聚类, 合成少数过采样技术(SMOTE)

Abstract: Imbalanced data analysis is one of the key technologies of intelligent manufacturing, and its classification problem has become a research hotspot in machine learning and data mining. Aiming at the problem of artificial synthetic data marginalization and noise reduction in the current imbalanced data oversampling strategy, this paper proposes an over-
sampling algorithm based on improved SMOTE(synthetic minority oversampling technique) and LOF(local outlier factor). Firstly, perform [K]-means clustering on the entire data set, select high-reliability samples for oversampling with the improved SMOTE algorithm, and finally use LOF algorithm to delete artificially synthesized samples with large errors. The experimental results on 4 UCI imbalanced data sets show that the method is effective. The classification ability of minority class in imbalanced data is stronger, which effectively overcomes the problem of data marginalization. The algorithm is applied to imbalanced data in phosphoric acid production, and the accurate classification of imbalanced data in phosphoric acid production is realized.

Key words: imbalanced data, over-sampling, local outlier factor, clustering, synthetic minority oversampling technique (SMOTE)