计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (10): 79-95.DOI: 10.3778/j.issn.1002-8331.2406-0107

• 理论与研发 • 上一篇    下一篇

基于加权与动态选择的不平衡数据流分类算法

韩萌,李春鹏,李昂,孟凡兴,何菲菲,张瑞华   

  1. 北方民族大学 计算机科学与工程学院,银川 750021
  • 出版日期:2025-05-15 发布日期:2025-05-15

Imbalanced Stream Classification Algorithm Based on Weighted and Dynamic Selection

HAN Meng, LI Chunpeng, LI Ang, MENG Fanxing, HE Feifei, ZHANG Ruihua   

  1. School of Computer Science & Engineering, North Minzu University, Yinchuan 750021, China
  • Online:2025-05-15 Published:2025-05-15

摘要: 在数据挖掘领域中,数据流挖掘是一项关键任务,旨在处理不断产生和演化的数据流。与传统的批处理数据挖掘不同,数据流挖掘强调对实时数据的处理和分析,具有更高的时效性和实用性。然而,现实世界的数据流中存在多类别不平衡、变化的类别不平衡比和概念漂移等实际挑战,会极大地降低分类器的性能。针对这些问题,提出了一种基于加权与动态选择的不平衡数据流分类算法(sample difficulty weighting and dynamic ensemble selection,SDW-DES),通过综合考虑样本难度和数据动态性,为实时应用提供可靠解决方案。引入一种基于样本分类难度的加权策略,结合样本的边际值和Focal Loss,以更有效地关注易分类错误的样本和少数类样本,从而提高分类器的准确性。提出一种灵活的动态集成选择方法,通过设计样本滑动窗口和困难样本滑动窗口,来综合分析分类器在不同窗口上的表现并加权,选出集成中最好的分类器进行预测,以适应数据分布的动态变化。在多种数据流环境和评估指标上与9种先进的算法进行了全面的实验评估,实验结果表明SDW-DES在4个评估指标中平均排名第一,并且更能够适应数据流中的不平衡和概念漂移问题。

关键词: 数据流分类, 多类不平衡, 概念漂移, 样本加权, 动态集成选择

Abstract: In the field of data mining, data stream mining is a critical task aimed at processing continuously generated and evolving data streams. Unlike traditional batch data mining, data stream mining emphasizes real-time data processing and analysis, offering higher timeliness and practicality. However, real-world data streams present practical challenges such as multi-class imbalance, varying class imbalance ratios, and concept drift, which can significantly degrade classifier performance. To address these issues, an imbalanced stream classification algorithm based on weighted and dynamic selection (SDW-DES) is proposed. This algorithm provides a reliable solution for real-time applications by comprehensively considering sample difficulty and data dynamics. A weighting strategy based on sample classification difficulty is introduced, which incorporates margin values and Focal Loss to more effectively focus on easily misclassified samples and minority class samples, thereby improving classifier accuracy. A flexible dynamic ensemble selection method is proposed, which utilizes sample sliding windows and hard sample sliding windows to comprehensively analyze classifier performance across different windows. This method assigns weights and selects the best classifiers for ensemble prediction to adapt to the dynamic changes in data distribution. A comprehensive experimental evaluation is conducted on various data stream environments and evaluation metrics, SDW-DES is compared with 9 advanced algorithms. The experimental results demonstrate that SDW-DES achieves the highest average ranking across 4 evaluation metrics, and possess superior adaptability to the challenges of imbalance and concept drift in data streams.

Key words: data stream classification, multi-class imbalance, concept drift, sample weighting, dynamic ensemble selection