Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (13): 96-101.DOI: 10.3778/j.issn.1002-8331.2005-0317

Previous Articles     Next Articles

Entropy-Based Oversampling Framework

ZHANG Nianpeng, WU Xu, ZHU Qiang   

  1. School of Mathematics and Statistics, Xidian University, Xi’an 710071, China
  • Online:2021-07-01 Published:2021-06-29



  1. 西安电子科技大学 数学与统计学院,西安 710071


Although data mining technology has gradually matured and is being dealt with in a wide range of practical problems, the field still faces many challenges, such as the problem of imbalanced datasets classification. When oversampling technology is used to deal with such problems,  usually only the imbalance of the number is considered,  not whether the data distribution is balanced. This paper uses information entropy to measure the local density information of the dataset, which considers the imbalance of the dataset from its distribution. In addition, it also proposes the concept of dangerous set and its three usage strategies, namely entropy-based dangerous set oversampling algorithm, entropy-based safe set oversampling algorithm and entropy-based adaptive oversampling algorithm. Experimental results show that these algorithms can effectively improve the performance of classic oversampling algorithms. For follow-up studies how to use entropy information theory processing imbanlanced data provides a successful experience.

Key words: data mining, imbalanced dataset, data classification, data distribution, information entropy



关键词: 数据挖掘, 不平衡数据, 数据分类, 数据分布, 信息熵