基于熵的过采样框架

doi:10.3778/j.issn.1002-8331.2005-0317

计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (13): 96-101.DOI: 10.3778/j.issn.1002-8331.2005-0317

基于熵的过采样框架

张念蓬，吴旭，朱强

西安电子科技大学数学与统计学院，西安 710071

出版日期:2021-07-01 发布日期:2021-06-29

Entropy-Based Oversampling Framework

ZHANG Nianpeng, WU Xu, ZHU Qiang

School of Mathematics and Statistics, Xidian University, Xi’an 710071, China

Online:2021-07-01 Published:2021-06-29

摘要/Abstract

摘要：

数据挖掘与机器学习技术日益趋向成熟并且被广泛应用于实际问题的处理中，但该领域仍面临着诸多挑战，如不平衡数据集分类问题。利用过采样技术处理这类问题时，通常只考虑数量的不平衡，而不考虑数据分布是否平衡。利用信息熵度量数据集的局部密度信息，从分布上考虑数据集的不平衡程度，并提出了基于熵的危险集的概念和它的三种使用策略，即基于熵的危险集过采样算法、基于熵的安全集过采样算法和基于熵的自适应过采样算法。竞争性的实验结果表明，这些算法可以有效提升经典过采样算法的性能，为进一步利用信息熵理论研究不平衡数据集提供了成功的实践经验。

关键词: 数据挖掘, 不平衡数据, 数据分类, 数据分布, 信息熵

Abstract:

Although data mining technology has gradually matured and is being dealt with in a wide range of practical problems, the field still faces many challenges, such as the problem of imbalanced datasets classification. When oversampling technology is used to deal with such problems, usually only the imbalance of the number is considered, not whether the data distribution is balanced. This paper uses information entropy to measure the local density information of the dataset, which considers the imbalance of the dataset from its distribution. In addition, it also proposes the concept of dangerous set and its three usage strategies, namely entropy-based dangerous set oversampling algorithm, entropy-based safe set oversampling algorithm and entropy-based adaptive oversampling algorithm. Experimental results show that these algorithms can effectively improve the performance of classic oversampling algorithms. For follow-up studies how to use entropy information theory processing imbanlanced data provides a successful experience.

Key words: data mining, imbalanced dataset, data classification, data distribution, information entropy

张念蓬，吴旭，朱强. 基于熵的过采样框架[J]. 计算机工程与应用, 2021, 57(13): 96-101.

ZHANG Nianpeng, WU Xu, ZHU Qiang. Entropy-Based Oversampling Framework[J]. Computer Engineering and Applications, 2021, 57(13): 96-101.

[1]	宗晓萍，陶泽泽. 基于掌握速度的知识追踪模型[J]. 计算机工程与应用, 2021, 57(6): 117-123.
[2]	王鹏，叶学义，王涛，钱丁炜. 双偏差双空间局部方向模式的人脸识别[J]. 计算机工程与应用, 2021, 57(4): 91-99.
[3]	高天宇，王庆荣，杨磊. 粗糙集属性依赖度强化的应急数据挖掘模型[J]. 计算机工程与应用, 2021, 57(3): 87-93.
[4]	陈俊丰，郑中团. WKMeans与SMOTE结合的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(23): 106-112.
[5]	王乐，韩萌，李小娟，张妮，程浩东. 不平衡数据集分类方法综述[J]. 计算机工程与应用, 2021, 57(22): 42-52.
[6]	江魁，丘远东，郑浩城. 基于信息熵与LSTM的ICMPv6 DDoS攻击检测方法[J]. 计算机工程与应用, 2021, 57(21): 148-154.
[7]	孟东霞，李玉鑑. 利用自然最近邻的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(2): 91-96.
[8]	马洋，赵旭俊. 基于相关子空间的多源离群检测算法[J]. 计算机工程与应用, 2021, 57(17): 88-95.
[9]	宋世杰，陈开颜，张阳. 信息熵角度下的深度学习旁路安全评估框架[J]. 计算机工程与应用, 2021, 57(17): 138-146.
[10]	张博文，刘智，桑国明. 基于核密度波动的异常检测算法[J]. 计算机工程与应用, 2021, 57(12): 132-136.
[11]	饶加旺，马荣华. 改进核密度估计的空间点密度算法[J]. 计算机工程与应用, 2021, 57(11): 260-265.
[12]	张东梅，买日旦·吾守尔，古兰拜尔·吐尔洪. 面向高维混合不平衡信贷数据的单类分类方法[J]. 计算机工程与应用, 2021, 57(10): 233-240.
[13]	王彩文，杨有龙. 针对不平衡数据的改进的近邻分类算法[J]. 计算机工程与应用, 2020, 56(7): 30-38.
[14]	陈建促，王越，朱小飞，李章宇，林志航. 融合多特征图的野生动物视频目标检测方法[J]. 计算机工程与应用, 2020, 56(7): 221-227.
[15]	林克正，张元铭，李昊天. 信息熵加权的HOG特征提取算法研究[J]. 计算机工程与应用, 2020, 56(6): 147-152.

基于熵的过采样框架

Entropy-Based Oversampling Framework

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics