计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (3): 198-201.

• 数据库与信息处理 • 上一篇    下一篇

基于边界距离的多向量文本聚类方法

蔡东风,王智超,季 铎,张桂平   

  1. 沈阳航空工业学院 自然语言处理研究室,沈阳 110034
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-01-21 发布日期:2008-01-21
  • 通讯作者: 蔡东风

Border distance based multi-vector document clustering method

CAI Dong-feng,WANG Zhi-chao,JI Duo,ZHANG Gui-ping   

  1. Natural Language Processing Research Laboratory,Shenyang Institute of Aeronautical Engineering,Shenyang 110034,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-01-21 Published:2008-01-21
  • Contact: CAI Dong-feng

摘要: 文本聚类是自然语言处理中的一项重要研究课题,主要应用于信息检索和Web挖掘等领域。其中的关键是文本的表示和聚类算法。在层次聚类的基础上,提出了一种新的基于边界距离的层次聚类算法,该方法通过选择两个类间边缘样本点的距离作为类间距离,有效地利用类的边界信息,提高类间距离计算的准确性。综合考虑不同词性特征对文本的贡献,采用多向量模型对文本进行表示。不同文本集上的实验表明,基于边界距离的多向量文本聚类算法取得了较好的性能。

关键词: 距离计算, 文本表示, 多向量, 文本聚类

Abstract: Document clustering is an important task of natural language processing and is widely applicable in areas such as information retrieval and web mining.The representation of document and the clustering algorithm are the key issues of document clustering.In order to improve the precision of distance calculation,this paper put forward a novel border distance based document clustering approach,which chooses the average of distances between documents at the border of different clusters as the similarity between this pairwise of clusters and takes advantage of the border information of the clusters.Considering the contribution of different kinds of terms,documents are represented by multi-vector.Experimental results of different corpus have shown that the proposed approach outperforms other widely used hierarchical clustering methods.

Key words: distance computation, document representation, multi-vector, document clustering