计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (13): 160-162.DOI: 10.3778/j.issn.1002-8331.2009.13.046

• 数据库、信号与信息处理 • 上一篇    下一篇

新相似性度量在文档模糊聚类中的应用研究

郭建永,蔡 勇,甄艳霞   

  1. 江南大学 信息工程学院,江苏 无锡 214122
  • 收稿日期:2008-03-07 修回日期:2008-06-10 出版日期:2009-05-01 发布日期:2009-05-01
  • 通讯作者: 郭建永

Research on documents fuzzy clustering approach using similarity measure

GUO Jian-yong,CAI Yong,ZHEN Yan-xia   

  1. School of Information Technology,Jiangnan University,Wuxi,Jiangsu 214122,China
  • Received:2008-03-07 Revised:2008-06-10 Online:2009-05-01 Published:2009-05-01
  • Contact: GUO Jian-yong

摘要: 相似文档检索在文档管理中是很重要的,提出一种在大文档集中基于模糊聚类的快速高效的聚类方法,传统方法大都通过词与词之间的比较来检索文档,该方法让文档通过两层结构得出相似度。系统用预定义模糊簇来描述相似文档的特征向量,用这些向量估计相似度,由此得出文档之间的距离,系统应用了新的相似性度量方法,并通过实验证实了其可行性和高效性。

关键词: 文档聚类, 文档相似性, 模糊聚类,

Abstract: Searching for similar documents has a crucial role in document management.This paper aims for developing a fast and high quality method of searching similar documents based on fuzzy clustering in large document collections.Formerly,finding the similarity in documents is based on the strategy that uses word-by-word comparison.The proposed method in this study uses two layers structure and lets the documents pass through it to find the similarities.In this system,predefined fuzzy clusters are used to extract feature vectors of related documents for finding similar documents of them.Similarity measure is estimated based on these vectors.To do this,a distance based similarity measure is proposed.It has been seen in empirical results that the proposed system uses new similarity measure and has better performance compared with conventional similarity measurement systems.

Key words: document clustering, document similarity, fuzzy clustering, cluster