计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (13): 92-101.DOI: 10.3778/j.issn.1002-8331.2310-0438

• 理论与研发 • 上一篇    下一篇

基于边界过采样的图节点不平衡分类算法

武天昊,董明刚,谭若琦   

  1. 1.桂林理工大学 信息科学与工程学院,广西 桂林 541006
    2.广西嵌入式技术与智能系统重点实验室,广西 桂林 541006
  • 出版日期:2024-07-01 发布日期:2024-07-01

Boundary Oversampling Based Graph Node Imbalance Classification Algorithm

WU Tianhao, DONG Minggang, TAN Ruoqi   

  1. 1.School of Information Science and Engineering, Guilin University of Technology, Guilin, Guangxi 541006, China
    2.Guangxi Key Laboratory of Embedded Technology and Intelligent Systems, Guilin, Guangxi 541006, China
  • Online:2024-07-01 Published:2024-07-01

摘要: 在现实世界中,金融欺诈检测和疾病诊断是典型的图不平衡问题,基于过采样的图神经网络是解决此类问题的常用方法之一。然而,该方法难以保证生成边界样本的多样性,易导致分类性能下降。提出一种基于边界过采样的图节点不平衡分类算法(ImBS)来提升生成样本的多样性。ImBS通过双层图神经分类网络选择出每个类别中高置信度样本作为采样锚点,提高锚点的代表性。为了使生成样本分布更加合理,利用上一步得到的混淆矩阵,计算少数类误判的分布比例。并基于该分布比例,自适应计算不同类间生成的样本数量。在此基础上,提出基于锚点的混合过采样方法。通过混合异类锚点特征的方式过采样边界节点,达到增加样本多样性和扩展少数类决策边界的目的。此外,为了防止产生有害连接,引入个性化PageRank方法,为过采样样本生成邻域分布。在三个真实的数据集(Cora、CiteSeer和Cora-Ful)上的实验表明,该方法与9个代表性的方法对比具有明显优势。

关键词: 图神经网络, 不平衡节点分类, 边界过采样

Abstract: In the real world, financial fraud detection and disease diagnosis are typical instances of graph imbalanced problems. Graph neural networks based on oversampling are among the commonly employed methods to address such issues. However, this approach encounters challenges in ensuring the diversity of generated boundary samples, which can lead to a reduction in classification performance. This paper introduces a graph node imbalanced classification algorithm based on borderline oversampling (ImBS) to enhance the diversity of generated samples. Firstly, ImBS selects high-confidence samples from each class as sampling anchors using a two-layer graph neural classification network, enhancing the representativeness of the anchors. Next, to make the distribution of generated samples more reasonable, this paper utilizes the obtained confusion matrix from the previous step to calculate the distribution ratio of misclassified instances in the minority class. Based on this distribution ratio, an adaptive computation of the number of generated samples among different classes is proposed. Building upon this, a hybrid oversampling method based on anchors is introduced. It oversamples boundary nodes by blending dissimilar anchor features, aiming to increase sample diversity and expand the decision boundary of the minority class. Additionally, to prevent the generation of harmful connections, this paper introduces a personalized PageRank method for neighborhood distribution of oversampled samples. Experimental results on three real datasets (Cora, CiteSeer, and Cora-Full) demonstrate a clear advantage of this method in comparison to nine representative approaches.

Key words: graph neural network, imbalanced node classification, boundary oversampling