结合微聚类和主动学习的流分类方法

doi:10.3778/j.issn.1002-8331.2210-0230

摘要/Abstract

摘要： 数据流分类是数据挖掘中重要的研究内容，但是数据流中的概念漂移和标记成本昂贵的问题给分类带来了巨大的挑战。现有的研究工作大多采用基于主动学习的在线分类技术，一定程度上缓解了概念漂移和有限标签的问题，但是这些方法的分类效率较低，并且忽略了内存开销的问题。针对这些问题提出了一种结合微聚类和主动学习的流分类方法（a data stream classification method combining micro-clustering and active learning，CALC）。提出一种新的主动学习混合查询策略，将其与基于错误的表示学习相结合，从而在维护过程中衡量每个微聚类的重要性，通过动态维护一组微聚类以适应数据流中产生的概念漂移。采用基于微聚类的惰性学习方法，实现对数据流的分类，并完成对缓存微聚类的在线更新。使用三个真实数据集和三个人工合成数据集进行实验，结果显示CALC在分类准确率和内存开销方面优于现有的数据流分类算法。与基准模型（online reliable semi-supervised learning on evolving data streams，ORSL）相比，CALC的分类准确率有一定的提升，在六个数据集上的平均准确率分别提高了5.07、2.41、1.04、1.03、3.47、0.64个百分点。

关键词: 主动学习, 数据流分类, 微聚类, 概念漂移

Abstract: Data stream classification is an important research component in data mining, but the problems of concept drift and expensive labeling in data streams pose a great challenge to classification. Most of the existing research work adopts online classification technology based on active learning, which alleviates the problems of concept drift and limited labels to a certain extent. However, these methods are less efficient for classification and ignore the problem of memory overhead. Aiming at these problems, a data stream classification method combining micro-clustering and active learning is proposed（CALC）. Firstly, a new active learning hybrid query strategy is proposed to measure the importance of each microcluster during maintenance by combining it with error-based representative learning. Secondly, a set of microclusters is dynamically maintained to accommodate the concept drift generated in the data stream. In addition, an inert microcluster-based learning approach is used to achieve classification of the data stream and to accomplish online updates of the cached microclusters. Finally, comparative experiments are conducted using three real datasets and three simulated synthetic datasets, and the results show that CALC outperforms existing data stream classification algorithms in terms of classification accuracy and memory overhead. Compared with the benchmark model ORSL, the classification accuracy of CALC has been improved to a certain extent, and the average accuracy of the six data sets has been increased by 5.07, 2.41, 1.04, 1.03, 3.47 and 0.64 percentage points, respectively.

Key words: active learning, data stream classification, micro-clustering, concept drift

尹春勇, 陈双双. 结合微聚类和主动学习的流分类方法[J]. 计算机工程与应用, 2023, 59(20): 254-265.

YIN Chunyong, CHEN Shuangshuang. Data Stream Classification Method Combining Micro-Clustering and Active Learning[J]. Computer Engineering and Applications, 2023, 59(20): 254-265.

参考文献

[1] GAMA J.Knowledge discovery from data streams[M].[S.l.]：CRC Press，2010.
[2] DOMINGOS P，HULTEN G.Mining high-speed data streams[C]//Proceedings of the sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining，2000：71-80.
[3] WIDMER G，KUBAT M.Learning in the presence of concept drift and hidden contexts[J].Machine Learning，1996，23（1）：69-101.
[4] 文益民，刘帅，缪裕青，等.概念漂移数据流半监督分类综述[J].软件学报，2022，33（4）：1287-1314.
WEN Y M，LIU S，MIU Y Q，et al.Survey on semi-supervised classification of data streams with concept drifts[J].Journal of Software，2022，33（4）：1287-1314.
[5] DITZLER G，ROVERI M，ALIPPI C，et al.Learning in nonstationary environments：a survey[J].IEEE Computational Intelligence Magazine，2015，10（4）：12-25.
[6] BRZEZINSKI D，STEFANOWSKI J.Reacting to different types of concept drift：the accuracy updated ensemble algorithm[J].IEEE Transactions on Neural Networks and Learning Systems，2013，25（1）：81-94.
[7] 徐清妍，何丽，朱泓西.改进Hoeffding不等式的概念漂移检测方法[J].计算机工程与应用，2020，56（19）：55-61.
XU Q Y，HE L，ZHU H X.Improved detection method of concept drift based on the hoeffding inequality[J].Computer Engineering and Applications，2020，56（19）：55-61.
[8] 潘吴斌，程光，郭晓军，等.基于信息熵的自适应网络流概念漂移分类方法[J].计算机学报，2017，40（7）：1556-1571.
PAN W B，CHENG G，GUO X J，et al.An adaptive classification approach based on information entropy for network traffic in presence of concept drift[J].Chinese Journal of Computers，2017，40（7）：1556-1571.
[9] WOOLAM C，MASUD M M，KHAN L.Lacking labels in the stream：classifying evolving stream data with few labels[C]//International Symposium on Methodologies for Intelligent Systems.Berlin，Heidelberg：Springer，2009：552-562.
[10] MASUD M M，WOOLAM C，GAO J，et al.Facing the reality of data stream classification：coping with scarcity of labeled data[J].Knowledge and Information Systems，2012，33（1）：213-244.
[11] BREVE F，ZHAO L.Semi-supervised learning with concept drift using particle dynamics applied to network intrusion detection data[C]//2013 BRICS Congress on Computational Intelligence and 11th Brazilian Congress on Computational Intelligence，2013：335-340.
[12] BERTINI J R，LOPES A A，ZHAO L.Partially labeled data stream classification with the semi-supervised K-associated graph[J].Journal of the Brazilian Computer Society，2012，18（4）：299-310.
[13] LI P，WU X，HU X.Mining recurring concept drifts with limited labeled streaming data[C]//Proceedings of 2nd Asian Conference on Machine Learning，2010：241-252.
[14] DIN S U，SHAO J，KUMAR J，et al.Online reliable semi-supervised learning on evolving data streams[J].Information Sciences，2020，525：153-171.
[15] ?LIOBAIT? I，BIFET A，PFAHRINGER B，et al.Active learning with drifting streaming data[J].IEEE Transactions on Neural Networks and Learning Systems，2013，25（1）：27-39.
[16] 刘子昂，蒋雪，伍冬睿，等.基于池的无监督线性回归主动学习[J].自动化学报，2021，47（12）：2771-2783.
LIU Z A，JIANG X，WU D R，et al.Unsupervised pool-based active learning for linear regression[J].Acta Automatica Sinica，2021，47（12）：2771-2783.
[17] 李艳红，任霖，王素格，等.非平衡数据流在线主动学习方法[J/OL].自动化学报：1-13[2022-09-21].http：//kns.cnki.net/kcms/detail/11.2109.TP.20220608.0946.005.html.
LI Y H，REN L，WANG S G，et al.Online active learning method for imbalanced data stream[J/OL].Acta Automatica Sinica：1-13[2022-09-21].http：//kns.cnki.net/kcms/detail/11.2109.TP.20220608.0946.005.html.
[18] GAMA J，?LIOBAIT? I，BIFET A，et al.A survey on concept drift adaptation[J].ACM Computing Surveys（CSUR），2014，46（4）：1-37.
[19] BIFET A，GAVALDA R.Learning from time-changing data with adaptive windowing[C]//Proceedings of the 2007 SIAM International Conference on Data Mining，2007：443-448.
[20] NISHIDA K，YAMAUCHI K.Detecting concept drift using statistical testing[C]//International Conference on Discovery Science.Berlin，Heidelberg：Springer，2007：264-269.
[21] STREET W N，KIM Y S.A streaming ensemble algorithm（SEA） for large-scale classification[C]//Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining，2001：377-382.
[22] WANG H，FAN W，YU P S，et al.Mining concept-drifting data streams using ensemble classifiers[C]//Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining，2003：226-235.
[23] WIDYANTORO D H，YEN J.Relevant data expansion for learning concept drift from sparsely labeled data[J].IEEE Transactions on Knowledge and Data Engineering，2005，17（3）：401-412.
[24] HOSSEINI M J，GHOLIPOUR A，BEIGY H.An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams[J].Knowledge and Information Systems，2016，46（3）：567-597.
[25] CASALINO G，CASTELLANO G，MENCAR C.Incremental adaptive semi-supervised fuzzy clustering for data stream classification[C]//2018 IEEE Conference on Evolving and Adaptive Intelligent Systems（EAIS），2018：1-7.
[26] ZHENG X，LI P，HU X，et al.Semi-supervised classification on data streams with recurring concept drift and concept evolution[J].Knowledge-Based Systems，2021，215：106749.
[27] 李南.基于聚类假设的数据流分类算法[J].模式识别与人工智能，2017，30（1）：1-10.
LI N.Clustering assumption based classification algorithm for stream data[J].Pattern Recognition and Artificial Intelligence，2017，30（1）：1-10.
[28] HAQUE A，KHAN L，BARON M，et al.Efficient handling of concept drift and concept evolution over stream data[C]//2016 IEEE 32nd International Conference on Data Engineering（ICDE），2016：481-492.
[29] IENCO D，BIFET A，?LIOBAIT? I，et al.Clustering based active learning for evolving data streams[C]//International Conference on Discovery Science.Berlin，Heidelberg：Springer，2013：79-93.
[30] ZGRAJA J，GAMA J，WO?NIAK M.Active learning by clustering for drifted data stream classification[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases.Cham：Springer，2018：80-90.
[31] LU Y，CHEUNG Y M，TANG Y Y.Adaptive chunk-based dynamic weighted majority for imbalanced data streams with concept drift[J].IEEE Transactions on Neural Networks and Learning Systems，2019，31（8）：2764-2778.
[32] OZA N C，RUSSELL S.Experimental comparisons of online and batch versions of bagging and boosting[C]//Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining，2001：359-364.
[33] DITZLER G，POLIKAR R.Incremental learning of concept drift from streaming imbalanced data[J].IEEE Transactions on Knowledge and Data Engineering，2012，25（10）：2283-2301.
[34] BIFET A，HOLMES G，PFAHRINGER B，et al.Moa：massive online analysis，a framework for stream classification and clustering[C]//Proceedings of the First Workshop on Applications of Pattern Analysis，2010：44-50.
[35] BRZEZINSKI D，STEFANOWSKI J.Combining block-based and online methods in learning ensembles from concept drifting data streams[J].Information Sciences，2014，265：50-67.
[36] ELWELL R，POLIKAR R.Incremental learning of concept drift in nonstationary environments[J].IEEE Transactions on Neural Networks，2011，22（10）：1517-1531.
[37] KOLTER J Z，MALOOF M A.Dynamic weighted majority：an ensemble method for drifting concepts[J].The Journal of Machine Learning Research，2007，8：2755-2790.
[38] KHEZRI S，TANHA J，AHMADI A，et al.A novel semi-supervised ensemble algorithm using a performance-based selection metric to non-stationary data streams[J].Neurocomputing，2021，442：125-145.
[39] LIU W，ZHANG H，DING Z，et al.A comprehensive active learning method for multiclass imbalanced data streams with concept drift[J].Knowledge-Based Systems，2021，215：106778.