计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (18): 136-141.

• 数据库、信号与信息处理 • 上一篇    下一篇

基于概念和语义相似度的文本聚类算法

焦芬芬   

  1. 中国空空导弹研究院,河南 洛阳 471009
  • 出版日期:2012-06-21 发布日期:2012-06-20

Clustering method based on concept and semantic similarity

JIAO Fenfen   

  1. Avic China Airborne Missile Academy, Luoyang, Henan 471009, China
  • Online:2012-06-21 Published:2012-06-20

摘要: 提出一种基于概念和语义相似度的聚类算法TCBCSS(Text Clustering Based on Concept and Semantic Similarity),TCBCSS算法基于WordNet对文档概念进行抽取和归并,形成语义网络,利用小世界理论和网络的几何特性对其进行分析并构建概念列表来表示文档,不仅有效解决了“表达差异”问题也有利于文档相似度的计算。TCBCSS算法利用两个概念列表的语义相似度作为文档间相近程度的度量,以图为基础进行聚类分析,避免了有些聚类算法对聚簇形状的限制,试验证明TCBCSS算法提高了聚类质量。

关键词: 文本聚类, 概念, 文本表示, 小世界理论, 语义相似度

Abstract: This paper introduces a new document clustering method using concept and semantic similarity—Text Clustering Based on Concept and Semantic Similarity(TCBCSS). Key concept is extracted, instead of the keyword, to form semantic network. The semantic network is analyzed using Six Degrees of Separation and geometric characteristics, to build concept lists, which represent the document. This not only resolves the problem of differentially expressed, but also is more convenient for similarity computation. TCBCSS algorithm uses semantic similarity of concept lists as a measure of similarity between the two documents, and clusters the document based on graph, to avoid some?limitations?of?the?clustering algorithm?on?the?clustered shape. Experimental results prove that TCBCSS algorithm improves the quality of the clustering.

Key words: text clustering, concept, text representation, Six Degrees of Separation, semantic similarity