Border distance based multi-vector document clustering method

Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (3): 198-201.

• 数据库与信息处理 • Previous Articles Next Articles

Border distance based multi-vector document clustering method

CAI Dong-feng,WANG Zhi-chao,JI Duo,ZHANG Gui-ping

Natural Language Processing Research Laboratory，Shenyang Institute of Aeronautical Engineering，Shenyang 110034，China

Received:1900-01-01 Revised:1900-01-01 Online:2008-01-21 Published:2008-01-21
Contact: CAI Dong-feng

基于边界距离的多向量文本聚类方法

蔡东风,王智超,季铎,张桂平

沈阳航空工业学院自然语言处理研究室，沈阳 110034

通讯作者: 蔡东风

Abstract

Abstract: Document clustering is an important task of natural language processing and is widely applicable in areas such as information retrieval and web mining.The representation of document and the clustering algorithm are the key issues of document clustering.In order to improve the precision of distance calculation，this paper put forward a novel border distance based document clustering approach，which chooses the average of distances between documents at the border of different clusters as the similarity between this pairwise of clusters and takes advantage of the border information of the clusters.Considering the contribution of different kinds of terms，documents are represented by multi-vector.Experimental results of different corpus have shown that the proposed approach outperforms other widely used hierarchical clustering methods.

Key words: distance computation, document representation, multi-vector, document clustering

摘要： 文本聚类是自然语言处理中的一项重要研究课题，主要应用于信息检索和Web挖掘等领域。其中的关键是文本的表示和聚类算法。在层次聚类的基础上，提出了一种新的基于边界距离的层次聚类算法，该方法通过选择两个类间边缘样本点的距离作为类间距离，有效地利用类的边界信息，提高类间距离计算的准确性。综合考虑不同词性特征对文本的贡献，采用多向量模型对文本进行表示。不同文本集上的实验表明，基于边界距离的多向量文本聚类算法取得了较好的性能。

关键词: 距离计算, 文本表示, 多向量, 文本聚类

CAI Dong-feng,WANG Zhi-chao,JI Duo,ZHANG Gui-ping. Border distance based multi-vector document clustering method[J]. Computer Engineering and Applications, 2008, 44(3): 198-201.

蔡东风,王智超,季铎,张桂平. 基于边界距离的多向量文本聚类方法[J]. 计算机工程与应用, 2008, 44(3): 198-201.

[1]	REN Jianhua, SHEN Yanbin, MENG Xiangfu, WANG Wei. Document clustering based on association relations between terms [J]. Computer Engineering and Applications, 2016, 52(7): 86-90.
[2]	SONG Jian-jie¹，WANG Wei². Integrated clustering algorithm based on hybrid of SOM and improved PSO for Web document [J]. Computer Engineering and Applications, 2010, 46(34): 111-114.
[3]	WANG Fei，ZHANG De-xian，HAN Jin-shu，TAO Yong-bo. Research on document clustering based on ant colony combined with Fuzzy C-means [J]. Computer Engineering and Applications, 2010, 46(32): 126-129.
[4]	YANG Rui-long¹，ZHU Qing-sheng¹，XIE Hong-tao^1，2. Fast hybrid clustering for Web documents [J]. Computer Engineering and Applications, 2010, 46(22): 12-15.
[5]	GAO Qian，DAI Yue-ming. Fuzzy spectral clustering algorithm for document clustering [J]. Computer Engineering and Applications, 2010, 46(13): 142-144.
[6]	GE Shi-li¹,CHEN Xiao-xiao². Cluster analysis of college English writing in automated essay scoring [J]. Computer Engineering and Applications, 2009, 45(6): 145-148.
[7]	ZHOU Shu-qiu¹,LIU Zhen²,MENG Jun-xian². Fast collision detection between dynamic cloth and complex models [J]. Computer Engineering and Applications, 2009, 45(2): 216-218.
[8]	ZHAO Bin,ZHANG Yong-sheng. Study of XML documents ensemble clustering based on Bagging [J]. Computer Engineering and Applications, 2009, 45(14): 138-140.
[9]	GUO Jian-yong,CAI Yong,ZHEN Yan-xia. Research on documents fuzzy clustering approach using similarity measure [J]. Computer Engineering and Applications, 2009, 45(13): 160-162.
[10]	FU Shan-shan,WU Yang-yang. XML document clustering using frequent structure [J]. Computer Engineering and Applications, 2008, 44(9): 135-138.
[11]	WANG Qiang¹,ZHANG Yong-kui². Research on Chinese story link detection based on SVM [J]. Computer Engineering and Applications, 2008, 44(33): 141-143.
[12]	YANG Duan-li^1,2,HUANG Yong²,WANG Ke-jian²,YANG Su-lin²,LI Yan². Research on information searching based on BNR [J]. Computer Engineering and Applications, 2008, 44(13): 137-140.
[13]	YuePeng Cheng. A Document Clustering Approach Based on Term Clustering and Association Rules [J]. Computer Engineering and Applications, 2007, 43(5期): 178-181.

Border distance based multi-vector document clustering method

基于边界距离的多向量文本聚类方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 13

Recommended Articles

Metrics