Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (19): 71-75.DOI: 10.3778/j.issn.1002-8331.1611-0016

Previous Articles     Next Articles

Research and Implementation of KNN classification algorithm for streaming data based on Storm

ZHOU Zhiyang, FENG Baiming, YANG Penglin, WEN Xianghui   

  1. College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China
  • Online:2017-10-01 Published:2017-10-13

基于Storm的流数据KNN分类算法的研究与实现

周志阳,冯百明,杨朋霖,温向慧   

  1. 西北师范大学 计算机科学与工程学院,兰州 730070

Abstract: KNN(K-Nearest Neighbor) algorithm is a kind of classification algorithm which is simpler, more effective and easier to implement. It can be applied in the classification for larger data domain. In recent years, KNN algorithm has been paid great attention to study static big data sets, however, KNN algorithm has to be processed the streaming data sets online in more and more scenarios. Considering the streaming data with the characteristics of large, continuous, fast, not easy to store and restore; and the streaming processing system Storm with the characteristics of real-time and reliability, a modified KNN is proposed, which implements KNN on Strom to classify the streaming data online. By partitioning the whole sample set into multiple piece sets first, it then computes KNN of those to-be-classified vectors on each piece set, finally, the KNN are reduced to the whole KNN, thereby to achieve the classification of the to-be-classified vectors. Experiment results show that the proposed algorithm is able to meet the requirements of high throughput, scalability, real-time and accuracy for the classification of streaming data on the big data background.

Key words: Storm, K-Nearest Neighbor(KNN), streaming data, big data, data partition

摘要: KNN算法是一种简单、有效且易于实现的分类算法,可用于类域较大的分类。近年来对KNN算法的研究偏向于静态大数据集,不过,在越来越多的情况下要用KNN算法在线实时处理流数据。考虑到流式数据流量大,连续且快速,不易存储和恢复等特性,以及流处理系统Storm对流数据处理具有实时性、可靠性的特点,提出了基于Storm的流数据KNN分类算法,该算法首先对整个样本集进行划分,形成多个片集,然后计算出待分类向量在各片集上的[K]近邻,最后再将所有片集[K]近邻归约得出整体[K]近邻,实现待分类向量的分类。实验结果表明,基于Storm的流数据KNN分类算法能够满足大数据背景下对流数据分类的高吞吐量、可扩展性、实时性和准确性的要求。

关键词: Storm, KNN算法, 流数据, 大数据, 数据划分