基于MapReduce的随机抽样K-means算法

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (8): 74-79.

基于MapReduce的随机抽样K-means算法

王永贵，武超，戴伟

辽宁工程技术大学，辽宁葫芦岛 125105

出版日期:2016-04-15 发布日期:2016-04-19

K-means algorithm of random sample based on MapReduce

WANG Yonggui, WU Chao, DAI Wei

Liaoning Technical University, Huludao, Liaoning 125105, China

Online:2016-04-15 Published:2016-04-19

摘要/Abstract

摘要： K-means算法处理海量数据时，易产生系统内存溢出的现象。利用MapReduce框架改进K-means虽然解决了这个问题，但也存在着聚类效果不稳定以及准确率不高等问题，提出一种改进算法，利用MapReduce框架实现K-means时，采用多次随机抽样，通过计算密度、距离与平方误差等方法，最终选取较优的初始聚类中心，并在迭代中采用新的中心点计算方法。实验结果证明，改进后的算法具有较好的稳定性、准确性和加速比。

关键词: K-means, 随机抽样, 海量数据, MapReduce

Abstract: The K-means algorithm when dealing with massive data, is easy to bring the phenomenon of memory overflow. Although this problem is solved by using the MapReduce framework to improve K-means, the phenomenon clustering effect is not so stable and the accuracy is not so high. It is necessary to raise an improved algorithm, which uses MapReduce framework to implement the K-means, by means of random sampling, calculating density, distance and the square difference. Finally, it selects the best initial cluster center and adopts the new method of center point calculation in the iteration. Experimental results show that, the improved algorithm has good stability， accuracy and accelerating ratio.

Key words: K-means, random sampling, massive data, MapReduce

王永贵，武超，戴伟. 基于MapReduce的随机抽样K-means算法[J]. 计算机工程与应用, 2016, 52(8): 74-79.

WANG Yonggui, WU Chao, DAI Wei. K-means algorithm of random sample based on MapReduce[J]. Computer Engineering and Applications, 2016, 52(8): 74-79.

[1]	王昌龙，张远东，缪宏，杨煜恒. 双通道卷积神经网络在南瓜病害识别上的应用[J]. 计算机工程与应用, 2021, 57(5): 183-189.
[2]	张子然，黄卫华，陈阳，章政，李梓远. 基于双向搜索的改进蚁群路径规划算法[J]. 计算机工程与应用, 2021, 57(21): 270-277.
[3]	程婧怡，段先华，朱伟. 改进YOLOv3的金属表面缺陷检测研究[J]. 计算机工程与应用, 2021, 57(19): 252-258.
[4]	陈元文. MapReduce技术在物资调运与配载问题中的应用[J]. 计算机工程与应用, 2021, 57(12): 273-278.
[5]	潘成胜，张斌，吕亚娜，杜秀丽，邱少明. 改进灰狼优化算法的K-Means文本聚类[J]. 计算机工程与应用, 2021, 57(1): 188-193.
[6]	高玮军，师阳，杨杰，张春霞. 一种改进的轻量人头检测方法[J]. 计算机工程与应用, 2021, 57(1): 207-212.
[7]	范文兵，孙志远. 基于小波域广义高斯分布的SAR图像分割算法[J]. 计算机工程与应用, 2020, 56(5): 222-226.
[8]	王卫红，曾英杰. 基于聚类和用户偏好的协同过滤推荐算法[J]. 计算机工程与应用, 2020, 56(3): 68-73.
[9]	宗晓萍，田伟倩. 采用K-means的脑肿瘤磁共振图像分割与特征提取[J]. 计算机工程与应用, 2020, 56(3): 187-193.
[10]	王子龙，李进，宋亚飞. 基于距离和权重改进的K-means算法[J]. 计算机工程与应用, 2020, 56(23): 87-94.
[11]	张震，李浩方，李孟州. YOLO算法在安检异常图像中的研究[J]. 计算机工程与应用, 2020, 56(21): 187-193.
[12]	马京晖，潘巍，王茹. 基于K-means聚类的三维点云分类[J]. 计算机工程与应用, 2020, 56(17): 181-186.
[13]	马克勤，杨延娇，秦红武，耿琳，王丕栋. 结合最大最小距离和加权密度的K-means聚类算法[J]. 计算机工程与应用, 2020, 56(16): 50-54.
[14]	郭永坤，章新友，刘莉萍，丁亮，牛晓录. 优化初始聚类中心的K-means聚类算法[J]. 计算机工程与应用, 2020, 56(15): 172-178.
[15]	李峰，李明祥，张宇敬. 局部迭代的快速K-means聚类算法[J]. 计算机工程与应用, 2020, 56(13): 63-71.