计算机工程与应用 ›› 2013, Vol. 49 ›› Issue (8): 6-8.

• 博士论坛 • 上一篇    下一篇

基于半监督近邻传播的数据流聚类算法

王文帅1,2,陈  刚1   

  1. 1.中国科学院 高能物理研究所 计算中心,北京 100049
    2.中国科学院大学,北京 100049
  • 出版日期:2013-04-15 发布日期:2013-04-15

Data stream clustering algorithm based on semi-supervised affinity propagation

WANG Wenshuai1,2, CHEN Gang1   

  1. 1.Computing Center, Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049, China
    2.University of Chinese Academy of Sciences, Beijing 100049, China
  • Online:2013-04-15 Published:2013-04-15

摘要: 为了提高进化数据流的聚类质量,提出基于半监督近邻传播的数据流聚类算法(SAPStream),该算法借鉴半监督聚类的思想对初始数据流构造相似度矩阵进行近邻传播聚类,建立在线聚类模型,随着数据流的进化,应用衰减窗口技术对聚类模型适时做出调整,对产生的类代表点和新到来的数据点再次聚类得到数据流的聚类结果。对数据流进行动态聚类的实验结果表明该算法是高质有效的。

关键词: 数据流, 半监督, 近邻传播聚类, 衰减窗口

Abstract: In order to improve the clustering quality of evolving data stream, this paper introduces a new data stream clustering algorithm, clustering over data Stream based on Semi-supervised Affinity Propagation(SAPStream), this algorithm calculates the similarity matrix of the initial data with the idea of semi-supervised, executes AP cluster, and then builds online clustering model. With the evolution of the data stream, the clustering model adjusts using decay windows technology, and the data stream clustering results are got by executing cluster again over the exemplars and new arrival data points. SAPStream can analyze and deal with large-scale evolving data stream. Its performance is tested by using both real datasets and synthetic datasets. Experimental results show this algorithm achieves a higher quality of clustering.

Key words: data stream, semi-supervised, affinity propagation clustering, decay windows