计算机工程与应用 ›› 2010, Vol. 46 ›› Issue (15): 127-131.DOI: 10.3778/j.issn.1002-8331.2010.15.038

• 数据库、信号与信息处理 • 上一篇    下一篇

数据划分优化的并行k-means算法

尹建君1,王 乐2

  

  1. 1.成都医学院 人文信息管理学院,成都 610083
    2.国防科技大学 计算机学院,长沙 410073
  • 收稿日期:2008-11-18 修回日期:2009-02-23 出版日期:2010-05-21 发布日期:2010-05-21
  • 通讯作者: 尹建君

Parallel k-means optimized by vertical dataset division

YIN Jian-jun1,WANG Le2   

  1. 1.School of Humanity and Information Management,Chengdu Medical College,Chengdu 610083,China
    2.College of Computer,National University of Defense Technology,Changsha 410073,China
  • Received:2008-11-18 Revised:2009-02-23 Online:2010-05-21 Published:2010-05-21
  • Contact: YIN Jian-jun

摘要: 针对大规模文本聚类中对聚类算法执行效率的要求,提出了一个内容相关的纵向数据划分策略FTDV,并基于该策略提出了数据划分优化的并行DVP k-means算法,提高了常规并行k-means算法的并行化程度,达到了优化算法执行效率的目的。在实验中,与常规并行k-means算法和基于关键方向分解的PDDP k-means算法进行比较,DVP k-means具有更好的并行性和对数据规模的适应性,且可以生成更高质量的聚簇。

关键词: 数据划分, 并行聚类算法, 频繁词集, k-means算法

Abstract: For the requirement of high efficiency in large volume of document clustering,this paper proposes a vertical content-related data partition politic,FTVD.A parallel clustering algorithm,called DVP k-means,is proposed based on above FTVD in order to optimize the parallel degree of traditional parallel k-means.Experimental results on two public datasets indicate that DVP k-means performs better than other two parallel algorithms,traditional parallel k-means and PDDP k-means,both on parallelism and feasibility.

Key words: data partition, parallel clustering algorithm, frequent term set, k-means

中图分类号: