计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (6): 92-95.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

一种新的过采样算法DB_SMOTE

刘余霞1,刘三民2,3,刘  涛2,王忠群4   

  1. 1.安徽工程大学 建筑工程学院,安徽 芜湖 241000
    2.安徽工程大学 计算机与信息学院,安徽 芜湖 241000
    3.南京航空航天大学 计算机科学与技术学院,南京 210016
    4.安徽工程大学 管理工程学院,安徽 芜湖 241000
  • 出版日期:2014-03-15 发布日期:2015-05-12

New oversampling algorithm DB_SMOTE

LIU Yuxia1, LIU Sanmin2,3, LIU Tao2, WANG Zhongqun4   

  1. 1.College of Civil Engineering and Architecture, Anhui Polytechnic University, Wuhu, Anhui 241000, China
    2.College of Computer and Information, Anhui Polytechnic University, Wuhu, Anhui 241000, China
    3.College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
    4.College of Management and Engineering, Anhui Polytechnic University, Wuhu, Anhui 241000, China
  • Online:2014-03-15 Published:2015-05-12

摘要: 针对非平衡数据集中类分布信息不对称现象,提出一种新的过采样算法DB_SMOTE(Distance-based Synthetic Minority Over-sampling Technique),通过合成少数类新样本解决样本不足问题。算法基于样本与类中心距离,结合类聚集程度提取种子样本。根据SMOTE(Synthetic Minority Over-sampling Technique)算法思想,在种子样本上实现少数类新样本合成。根据种子样本与少数类中心距离构造新样本分布函数。基于此采样算法并在多个数据集上进行分类实验,结果表明DB_SMOTE算法是可行的。

关键词: 非平衡数据学习, 过采样, 数据分类

Abstract: In order to solve the asymmetry of class distribution information in imbalanced data, DB_SMOTE(Distance-based Synthetic Minority Over-sampling Technique) algorithm is presented by minority new sample synthetic. According to the distance between sample and the centre of class, seed sample is gained by combining class aggregation. Based on SMOTE(Synthetic Minority Over-sampling Technique), new sample is synthesized. Based upon the distance between seed sample and the centre of minority class, new sample distribution function is formed. Classification experiment results show DB_SMOTE is feasible.

Key words: imbalanced data learning, oversampling, data classification