计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (5): 118-122.

• 数据库、信号与信息处理 • 上一篇    下一篇

一种基于Nutch的网页聚类系统的设计与实现

阳小兰,钱 程,赵海廷   

  1. 武汉科技大学中南分校 信息工程学院,武汉 430223
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-02-11 发布日期:2011-02-11

Design and implementation on Web clustering system based on Nutch

YANG Xiaolan,QIAN Cheng,ZHAO Haiting   

  1. College of Information Engineering,Wuhan University of Science and Technology Zhongnan Branch,Wuhan 430223,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-02-11 Published:2011-02-11

摘要: 设计了一种在中英文环境下、能够对Nutch的搜索结果进行聚类处理的搜索结果聚类系统,该系统基于k-means算法和后缀树聚类算法,是一个由Nutch搜索引擎、文本分词、TF-IDF权重计算以及文本聚类等模块构成的搜索引擎结果文档聚类系统,并通过实验对k-means算法和后缀树算法进行了对比。

关键词: Nutch, 聚类, k-means, 后缀树

Abstract: A search results clustering system which can be able to search cluster results obtained from Nutch is designed both in English and Chinese language environment.This system is based on k-means algorithm and suffix tree clustering algorithm and is made of Nutch module,TF-IDF weight calculation module and text clustering module.The k-means algorithm and suffix tree clustering algorithm are contrasted based on the experiments.

Key words: Nutch, clustering, k-means, suffix tree