Imbalanced data sets classification method based on over-sampling technique

doi:10.3778/j.issn.1002-8331.2011.01.038

Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (1): 139-143.DOI: 10.3778/j.issn.1002-8331.2011.01.038

• 数据库、信号与信息处理 • Previous Articles Next Articles

Imbalanced data sets classification method based on over-sampling technique

WANG Chunyu，SU Hongye，QU Yu，CHU Jian

State Key Lab of Industrial Control Technology，Institute of Cyber-Systems & Control，Zhejiang University，Hangzhou 310027，China

Received:2010-08-09 Revised:2010-10-14 Online:2011-01-01 Published:2011-01-01
Contact: WANG Chunyu

一种基于过抽样技术的非平衡数据集分类方法

王春玉，苏宏业，渠瑜，褚健

浙江大学智能系统与控制研究所工业控制技术国家重点实验室，杭州 310027

通讯作者: 王春玉

Abstract

Abstract: Classification of data with imbalanced class distribution is a research focus on machine learning.In order to resolve the imbalanced problems，especially those of the poor predictive accuracy over the minority class，this paper presents an improved approach，AdaBoost-SVM-OBMS，which is based on a combination of Boosting，an ensemble-based learning algorithm，and an improved over-sampling method based on misclassified samples.In this approach，using support vector machine as base classifier，the misclassified samples are identified during each iteration.Subsequently，they are used to separately generate new samples for the majority and minority classes.The new samples are then added to the original training set to retrain the classification model，which is used to improve the prediction of hard samples.This method is evaluated，in terms of the AUC，F-value，and G-mean，on eight imbalanced data sets.Results indicate that the improved approach produces high prediction in imbalanced data sets.

摘要： 非平衡数据集的分类问题是机器学习领域的一个研究热点。针对非平衡数据集分类困难的问题，特别是由于非平衡分布引起的少数类识别能力低下的问题，提出了一种改进算法，AdaBoost-SVM-OBMS。该算法结合Boosting算法和基于错分样本产生新样本的过抽样技术。在新算法中，以支持向量机为元分类器，每次Boosting迭代中标记出错分的样本点，然后在错分样本点与其近邻间随机产生一定数量与错分样本同一类别的新样本点。新产生样本点加入原训练集中重新训练学习，以提高分类困难样本的识别能力。在AUC，F-value和G-mean 3个不同价格的评价指标下8个benchmark数据集上对AdaBoost-SVM-OBMS算法与AdaBoost-SVM算法和APLSC算法进行了对比实验，实验结果表明了AdaBoost-SVM-OBMS算法在非平衡数据集分类中的有效性。

CLC Number:

TP181

WANG Chunyu，SU Hongye，QU Yu，CHU Jian. Imbalanced data sets classification method based on over-sampling technique[J]. Computer Engineering and Applications, 2011, 47(1): 139-143.

王春玉，苏宏业，渠瑜，褚健. 一种基于过抽样技术的非平衡数据集分类方法[J]. 计算机工程与应用, 2011, 47(1): 139-143.

[1]	SU Chen¹，NI Shihong¹，WANG Yanhong². Method of rule acquirement of flight state based on improved AIS [J]. Computer Engineering and Applications, 2011, 47(3): 237-239.
[2]	MA Lei，WANG Xili. Semi-supervised regression based on support vector machine co-training [J]. Computer Engineering and Applications, 2011, 47(3): 177-180.
[3]	HUANG Junheng¹，SUN Yushan²，ZHU Dongjie². Research of clustering algorithm based on diffusion model [J]. Computer Engineering and Applications, 2011, 47(2): 121-123.
[4]	YANG Peng^1，2，CHAI Xiaoyan³，SUN Junqing^1，2 . Research on problem of yard crane cooperative schedule [J]. Computer Engineering and Applications, 2011, 47(1): 231-233.
[5]	WU Jing-hua¹，ZHANG Xin-gang²，MENG Hai-liang³. Agent persuasion mechanism based on its matching degree of case [J]. Computer Engineering and Applications, 2010, 46(35): 235-237.
[6]	LIU Jian^1，2，LIU Zhong²，XIONG Ying¹. Improved multi-category support vector machines based on binary tree [J]. Computer Engineering and Applications, 2010, 46(33): 117-120.
[7]	ZHAO Hui¹，SUN Jun-qing^2，3. Decision support system for container terminal based on MAS [J]. Computer Engineering and Applications, 2010, 46(32): 241-243.
[8]	LIU Yi，LIU Chuan-ju . Scheduling strategy design of thread pool for real-time control system [J]. Computer Engineering and Applications, 2010, 46(32): 71-73.
[9]	ZHOU Fu-jiang^1，2，TIAN Wei-feng¹，ZHU Xiao-dong¹. Applications of RS_SVM in equipment maintenance costs prediction [J]. Computer Engineering and Applications, 2010, 46(31): 222-224.
[10]	LIU Song¹，GAO Chang-yuan². Construction of conflict solution system of profits sharing for high-tech virtual enterprise [J]. Computer Engineering and Applications, 2010, 46(28): 17-21.
[11]	GUI Xian-cai. Improvement of dominance discernibility matrix and computation of core [J]. Computer Engineering and Applications, 2010, 46(27): 36-38.
[12]	HE Rong-fu²，ZHANG Ling¹. Dual of one direction S-rough sets-based knowledge addition and knowledge battlement recognition [J]. Computer Engineering and Applications, 2010, 46(27): 39-42.
[13]	GE Hao¹，YANG Chuan-jian²，LI Long-shu³. Efficient algorithm for computing core attributes [J]. Computer Engineering and Applications, 2010, 46(26): 138-141.
[14]	JIANG Yu-jiao¹，WANG Xiao-dan¹，WANG Wen-jun²，BI Kai¹. New feature selection approach by PCA and ReliefF [J]. Computer Engineering and Applications, 2010, 46(26): 170-172.
[15]	YU Jian-ping¹，ZHOU Xin-min²，CHEN Ming¹. Research on representative algorithms of swarm intelligence [J]. Computer Engineering and Applications, 2010, 46(25): 1-4.

Imbalanced data sets classification method based on over-sampling technique

一种基于过抽样技术的非平衡数据集分类方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics