计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (13): 96-101.DOI: 10.3778/j.issn.1002-8331.2005-0317

• 大数据与云计算 • 上一篇    下一篇

基于熵的过采样框架

张念蓬,吴旭,朱强   

  1. 西安电子科技大学 数学与统计学院,西安 710071
  • 出版日期:2021-07-01 发布日期:2021-06-29

Entropy-Based Oversampling Framework

ZHANG Nianpeng, WU Xu, ZHU Qiang   

  1. School of Mathematics and Statistics, Xidian University, Xi’an 710071, China
  • Online:2021-07-01 Published:2021-06-29

摘要:

数据挖掘与机器学习技术日益趋向成熟并且被广泛应用于实际问题的处理中,但该领域仍面临着诸多挑战,如不平衡数据集分类问题。利用过采样技术处理这类问题时,通常只考虑数量的不平衡,而不考虑数据分布是否平衡。利用信息熵度量数据集的局部密度信息,从分布上考虑数据集的不平衡程度,并提出了基于熵的危险集的概念和它的三种使用策略,即基于熵的危险集过采样算法、基于熵的安全集过采样算法和基于熵的自适应过采样算法。竞争性的实验结果表明,这些算法可以有效提升经典过采样算法的性能,为进一步利用信息熵理论研究不平衡数据集提供了成功的实践经验。

关键词: 数据挖掘, 不平衡数据, 数据分类, 数据分布, 信息熵

Abstract:

Although data mining technology has gradually matured and is being dealt with in a wide range of practical problems, the field still faces many challenges, such as the problem of imbalanced datasets classification. When oversampling technology is used to deal with such problems,  usually only the imbalance of the number is considered,  not whether the data distribution is balanced. This paper uses information entropy to measure the local density information of the dataset, which considers the imbalance of the dataset from its distribution. In addition, it also proposes the concept of dangerous set and its three usage strategies, namely entropy-based dangerous set oversampling algorithm, entropy-based safe set oversampling algorithm and entropy-based adaptive oversampling algorithm. Experimental results show that these algorithms can effectively improve the performance of classic oversampling algorithms. For follow-up studies how to use entropy information theory processing imbanlanced data provides a successful experience.

Key words: data mining, imbalanced dataset, data classification, data distribution, information entropy