利用自然最近邻的不平衡数据过采样方法

doi:10.3778/j.issn.1002-8331.1910-0218

计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (2): 91-96.DOI: 10.3778/j.issn.1002-8331.1910-0218

利用自然最近邻的不平衡数据过采样方法

孟东霞，李玉鑑

1.河北金融学院金融科技学院，河北保定 071051
2.桂林电子科技大学人工智能学院，广西桂林 541004

出版日期:2021-01-15 发布日期:2021-01-14

Oversampling Method for Unbalanced Data by Natural Nearest Neighbor

MENG Dongxia，LI Yujian

1.School of Financial Technology, Hebei Finance University, Baoding, Hebei 071051, China
2.School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin, Guangxi 541004, China

Online:2021-01-15 Published:2021-01-14

摘要/Abstract

摘要：

针对现有过采样方法存在的易引入噪声点、合成样本重叠的问题，提出一种基于自然最近邻的不平衡数据过采样方法。确定少数类样本的自然最近邻，每个样本的近邻个数由算法自适应计算生成，反映了样本分布的疏密程度。基于自然近邻关系对少数类样本聚类，由位于同一类簇中密集区域的核心点和稀疏区域的非核心点生成新样本。在二维合成数据集和UCI数据集上的对比实验验证了该方法的可行性和有效性，提高了不平衡数据的分类精度。

关键词: 不平衡数据集, 过采样, 自然最近邻, 聚类

Abstract:

Aiming at the problem of introducing noise points and synthesizing overlapping samples in existing oversampling methods, this paper proposes an oversampling method based on natural nearest neighbors. The proposed method firstly determines the natural nearest neighbor for minority samples. Each sample’s number of nearest neighbors is generated by adaptive calculation in the algorithm, which reflects the density of distribution. After cluster analysis for minority samples based on relations of natural neighbor, this method generates new samples using core points in dense area and non-core points in sparse area from the same cluster. The comparison experiments on a two-dimensional synthesis dataset and UCI datasets verify the feasibility and effectiveness of this method and improve the classification accuracy of unbalanced data.

Key words: imbalanced data set, over sampling;natural nearest neighbor, clustering

孟东霞，李玉鑑. 利用自然最近邻的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(2): 91-96.

MENG Dongxia，LI Yujian. Oversampling Method for Unbalanced Data by Natural Nearest Neighbor[J]. Computer Engineering and Applications, 2021, 57(2): 91-96.

[1]	兰红，黄敏. 融合KNN优化的密度峰值和FCM聚类算法[J]. 计算机工程与应用, 2021, 57(9): 81-88.
[2]	郭晓静，隋昊达. 改进YOLOv3在机场跑道异物目标检测中的应用[J]. 计算机工程与应用, 2021, 57(8): 249-255.
[3]	李莉，纪欣沅，宋嵩. 回环软件缺陷数量预测模型[J]. 计算机工程与应用, 2021, 57(7): 158-163.
[4]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[5]	杨芳，尹曦，司建辉，刘宏媛，汪雪. 基于侧重点聚类的数学表达式相似度计算方法[J]. 计算机工程与应用, 2021, 57(6): 88-93.
[6]	赵凡，张琳，闻治泉，杨林林，蔺广逢. 一种直接高效的自然场景汉字逼近定位方法[J]. 计算机工程与应用, 2021, 57(6): 159-167.
[7]	彭启慧，宣士斌，高卿. 分布的自动阈值密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(5): 71-78.
[8]	李勇振，廖湖声. 基于图卷积神经网络的多视角聚类[J]. 计算机工程与应用, 2021, 57(5): 115-122.
[9]	王昌龙，张远东，缪宏，杨煜恒. 双通道卷积神经网络在南瓜病害识别上的应用[J]. 计算机工程与应用, 2021, 57(5): 183-189.
[10]	胡晓敏，王明丰，张首荣，李敏. 用于文本聚类的新型差分进化粒子群算法[J]. 计算机工程与应用, 2021, 57(4): 61-67.
[11]	王俊玲，卢新明. 基于语义相关的视频关键帧提取算法[J]. 计算机工程与应用, 2021, 57(4): 192-198.
[12]	王芙银，张德生，张晓. 结合鲸鱼优化算法的自适应密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(3): 94-102.
[13]	陈俊丰，郑中团. WKMeans与SMOTE结合的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(23): 106-112.
[14]	张忠林，赵昱，闫光辉. 自然邻居密度极值聚类算法[J]. 计算机工程与应用, 2021, 57(23): 200-210.
[15]	王乐，韩萌，李小娟，张妮，程浩东. 不平衡数据集分类方法综述[J]. 计算机工程与应用, 2021, 57(22): 42-52.

利用自然最近邻的不平衡数据过采样方法

Oversampling Method for Unbalanced Data by Natural Nearest Neighbor

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics