计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (16): 143-158.DOI: 10.3778/j.issn.1002-8331.2305-0191

• 模式识别与人工智能 • 上一篇    下一篇

跨类别样本迁移框架下的不平衡分类方法

于海波,刘婧,李强伟,高欣,谭煌,陈天阳   

  1. 1.中国电力科学研究院有限公司 计量研究所,北京 100192
    2.北京邮电大学 人工智能学院,北京 100876
  • 出版日期:2024-08-15 发布日期:2024-08-15

Imbalanced Classification Method Based on Cross-Class Sample Migration Framework

YU Haibo, LIU Jing, LI Qiangwei, GAO Xin, TAN Huang, CHEN Tianyang   

  1. 1.Research Institute of Metrology, China Electric Power Research Institute Co., Ltd., Beijing 100192, China
    2.School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Online:2024-08-15 Published:2024-08-15

摘要: 对于不平衡分类问题,实现类别交叠区域中样本数目和分布的平衡是缓解后续决策偏移的关键,而现有的不平衡分类方法往往只从少数类样本生成新样本来达到样本数目的平衡,没有充分利用多数类样本丰富的信息。特别是在少数类样本绝对数量过少的情况下,仅利用原始少数类样本信息无法有效平衡交叠区域样本的分布。提出了一种跨类别样本迁移框架下的不平衡分类方法。在变分自编码器(variational autoencoder,VAE)隐编码采样过程中嵌入由全连接层构建的映射网络,在VAE充分学习不同类别样本的共性和特性的基础上,在隐编码先验约束和跨域一致性约束下对多数类样本的隐编码进行映射转换,使转换前后隐编码共享相同的分布空间,并通过VAE中解码器实现多数类样本向少数类样本的迁移。同时融入生成对抗机制,对原始样本和新样本以及转换前后的隐编码进行判别对抗,进一步提升迁移样本的可靠性。在此基础上,分别对新生成样本与原始不同类别样本的距离进行加权约束,并筛选得到更加靠近交叠区域的样本,使该区域不同类别样本的数目和分布更加平衡。在16个公共数据集上的实验结果表明,在F1测量值和G-均值上该方法显著优于10种典型的不平衡分类方法,特别是在11个不平衡比例较高、少数类样本绝对数量过少的公共数据集中,该方法性能提升更加显著。

关键词: 不平衡分类, 跨类别样本迁移框架, 变分自编码器, 映射网络, 生成对抗机制, 加权欧式距离约束

Abstract: For imbalanced classification problems, achieving a balance in the number and distribution of samples in overlapping region is the key to alleviate the subsequent decision bias. Existing imbalanced classification methods often generate new samples only from minority samples to balance the number of different class samples, but do not make full use of the rich information of majority samples. Especially when the absolute number of minority samples is too small, only using the original minority sample information cannot effectively balance the distribution of samples in overlapping regions. An imbalanced classification method based on cross-class sample migration framework is proposed. Firstly, a mapping network constructed by the fully connected layer is embedded in the variational autoencoder (VAE) hidden code sampling process. By fully learning the commonality and characteristics of different classes of samples, the hidden code of majority samples is mapped and transformed under the influence of hidden coding prior constraints and cross-domain consistency constraints. This makes the hidden codes before and after conversion share the same distribution space, and enables the decoder in VAE to migrate majority samples to minority samples. At the same time, a generative confrontation mechanism is introduced to discriminate the original sample and the new sample, as well as the hidden codes before and after conversion, to further improve the reliability of the migrated sample. Furthermore, the distances between the newly generated samples and the original samples of different categories are weighted, and the samples closer to the overlapping region are obtained by screening, so that the number and distribution of different types of samples in the overlapping region are more balanced. Experimental results on 16 public datasets show that the proposed method is significantly superior to 10 typical imbalanced classification methods in F1 measure and G-mean. Especially in 11 public datasets with high imbalance ratio and small absolute number of minority samples, the performance improvement of the proposed method is more significant.

Key words: imbalanced classification, cross-class sample migration framework, variational autoencoders, mapping network, generative countermeasure mechanism, weighted Euclidean distance constraint