Entropy-Based Oversampling Framework

doi:10.3778/j.issn.1002-8331.2005-0317

Abstract

Abstract:

Although data mining technology has gradually matured and is being dealt with in a wide range of practical problems, the field still faces many challenges, such as the problem of imbalanced datasets classification. When oversampling technology is used to deal with such problems, usually only the imbalance of the number is considered, not whether the data distribution is balanced. This paper uses information entropy to measure the local density information of the dataset, which considers the imbalance of the dataset from its distribution. In addition, it also proposes the concept of dangerous set and its three usage strategies, namely entropy-based dangerous set oversampling algorithm, entropy-based safe set oversampling algorithm and entropy-based adaptive oversampling algorithm. Experimental results show that these algorithms can effectively improve the performance of classic oversampling algorithms. For follow-up studies how to use entropy information theory processing imbanlanced data provides a successful experience.

Key words: data mining, imbalanced dataset, data classification, data distribution, information entropy

摘要：

数据挖掘与机器学习技术日益趋向成熟并且被广泛应用于实际问题的处理中，但该领域仍面临着诸多挑战，如不平衡数据集分类问题。利用过采样技术处理这类问题时，通常只考虑数量的不平衡，而不考虑数据分布是否平衡。利用信息熵度量数据集的局部密度信息，从分布上考虑数据集的不平衡程度，并提出了基于熵的危险集的概念和它的三种使用策略，即基于熵的危险集过采样算法、基于熵的安全集过采样算法和基于熵的自适应过采样算法。竞争性的实验结果表明，这些算法可以有效提升经典过采样算法的性能，为进一步利用信息熵理论研究不平衡数据集提供了成功的实践经验。

关键词: 数据挖掘, 不平衡数据, 数据分类, 数据分布, 信息熵

ZHANG Nianpeng, WU Xu, ZHU Qiang. Entropy-Based Oversampling Framework[J]. Computer Engineering and Applications, 2021, 57(13): 96-101.

张念蓬，吴旭，朱强. 基于熵的过采样框架[J]. 计算机工程与应用, 2021, 57(13): 96-101.

[1]	ZONG Xiaoping, TAO Zeze. Knowledge Tracing Model Based on Mastery Speed [J]. Computer Engineering and Applications, 2021, 57(6): 117-123.
[2]	WANG Peng, YE Xueyi, WANG Tao, QIAN Dingwei. Face Recognition Based on Double Variation and Double Space Local Directional Pattern [J]. Computer Engineering and Applications, 2021, 57(4): 91-99.
[3]	GAO Tianyu, WANG Qingrong, YANG Lei. Data Mining Model Based on Attribute Dependability Enhancement of Rough Set [J]. Computer Engineering and Applications, 2021, 57(3): 87-93.
[4]	CHEN Junfeng, ZHENG Zhongtuan. Over-Sampling Method on Imbalanced Data Based on WKMeans and SMOTE [J]. Computer Engineering and Applications, 2021, 57(23): 106-112.
[5]	JIANG Kui, QIU Yuandong, ZHENG Haocheng. ICMPv6 DDoS Attack Detection Method Based on Information Entropy and LSTM [J]. Computer Engineering and Applications, 2021, 57(21): 148-154.
[6]	MA Yang, ZHAO Xujun. Multi-source Outlier Detection Algorithm Based on Relevant Subspace [J]. Computer Engineering and Applications, 2021, 57(17): 88-95.
[7]	SONG Shijie, CHEN Kaiyan, ZHANG Yang. Security Evaluation Framework of Deep Learning Side Channel Analysis from Information Entropy [J]. Computer Engineering and Applications, 2021, 57(17): 138-146.
[8]	LI Leixiao, DENG Dan, LI Jie, WANG Yongsheng. All-to-All Comparison Computing Data Distribution Strategy Based on Particle Swarm Optimization [J]. Computer Engineering and Applications, 2021, 57(15): 109-117.
[9]	ZHANG Bowen, LIU Zhi, SANG Guoming. Anomaly Detection Algorithm Based on Kernel Density Fluctuation [J]. Computer Engineering and Applications, 2021, 57(12): 132-136.
[10]	RAO Jiawang, MA Ronghua. Improved Kernel Density Estimator Based Spatial Point Density Algorithm [J]. Computer Engineering and Applications, 2021, 57(11): 260-265.
[11]	CHEN Jiancu, WANG Yue, ZHU Xiaofei, LI Zhangyu, LIN Zhihang. Wild Animal Video Object Detection Method Combining Multi-feature Map [J]. Computer Engineering and Applications, 2020, 56(7): 221-227.
[12]	LIN Kezheng, ZHANG Yuanming, LI Haotian. Research on HOG Feature Extraction Algorithm Weighted by Information Entropy [J]. Computer Engineering and Applications, 2020, 56(6): 147-152.
[13]	WANG Jie, CHEN Zhigang, LIU Jialing, CHENG Hongbing. Privacy Behavior Mining Technology for Cloud Computing Based on Clustering [J]. Computer Engineering and Applications, 2020, 56(5): 80-84.
[14]	WANG Zilong, LI Jin, SONG Yafei. Improved K-means Algorithm Based on Distance and Weight [J]. Computer Engineering and Applications, 2020, 56(23): 87-94.
[15]	BAI Fengbo, CHANG Lin, WANG Shifan, LI Bin, WANG Yingjie, ZHOU Hong, LIU Yao. Improved Method Study on Extracting Keywords in Chinese Judgment Documents [J]. Computer Engineering and Applications, 2020, 56(23): 153-160.

Entropy-Based Oversampling Framework

基于熵的过采样框架

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics