计算机工程与应用 ›› 2013, Vol. 49 ›› Issue (2): 184-187.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于改进SMOTE的非平衡数据集分类研究

王超学1,潘正茂1,董丽丽1,马春森2,张  星1   

  1. 1.西安建筑科技大学 信息与控制工程学院,西安 710055
    2.中国农业科学院 植物保护研究所,北京 100193
  • 出版日期:2013-01-15 发布日期:2013-01-16

Research on classification for imbalanced dataset based on improved SMOTE

WANG Chaoxue1, PAN Zhengmao1, DONG Lili1, MA Chunsen2, ZHANG Xing1   

  1. 1.School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China
    2.Institute of Plant Protection, Chinese Academy of Agricultural Sciences, Beijing 100193, China
  • Online:2013-01-15 Published:2013-01-16

摘要: 针对SMOTE(Synthetic Minority Over-sampling Technique)在合成少数类新样本时存在的不足,提出了一种改进的SMOTE算法(SSMOTE)。该算法的关键是将支持度概念和轮盘赌选择技术引入到SMOTE中,并充分利用了异类近邻的分布信息,实现了对少数类样本合成质量和数量的精细控制。将SSMOTE与KNN(K-Nearest Neighbor)算法结合来处理不平衡数据集的分类问题。通过在UCI数据集上与其他重要文献中的相关算法进行的大量对比实验表明,SSMOTE在新样本的整体合成效果上表现出色,有效提高了KNN在非平衡数据集上的分类性能。

关键词: 非平衡数据集, 分类, 支持度, 轮盘赌选择, 合成少数过采样技术(SMOTE)

Abstract: Based on analyzing the shortages of SMOTE(Synthetic Minority Over-sampling Technique), an improved SMOTE (SSMOTE) is presented. The key of SSMOTE lies on leading the concept of support and roulette wheel selection into SMOTE and making full use of the heterogeneous nearest-neighbor distribution information to achieve the fine control of the synthesis quality and quantity to the minority class samples. SSMOTE and KNN(K-Nearest Neighbor) are combined to handle the classification problem on imbalanced datasets, and extensive experiments are conducted to compare SSMOTE and algorithms in pertinent literatures on the UCI datasets. The simulation results show SSMOTE promises prominent synthesis effect to the minority class samples, and brings better classification performance on imbalanced datasets with KNN.

Key words: imbalanced datasets, classification, support, roulette wheel selection, Synthetic Minority Over-sampling Technique(SMOTE)