计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (5): 33-38.

• 博士论坛 • 上一篇    下一篇

基于语言网络和语义信息的文本相似度计算

詹志建,杨小平   

  1. 中国人民大学 信息学院 计算机系,北京 100872
  • 出版日期:2014-03-01 发布日期:2015-05-12

Text similarity calculation based on language network and semantic information

ZHAN Zhijian, YANG Xiaoping   

  1. Department of Computer, School of Information, Renmin University of China, Beijing 100872, China
  • Online:2014-03-01 Published:2015-05-12

摘要: 通过分析已有的基于统计和基于语义分析的文本相似性度量方法的不足,提出了一种新的基于语言网络和词项语义信息的文本相似度计算方法。对文本建立语言网络,计算网络节点综合特征值,选取TOP比例特征词表征文本,有效降低文本表示维度。计算TOP比例特征词间的相似度,以及这些词的综合特征值所占百分比以计算文本之间的相似度。利用提出的相似度计算方法在数据集上进行聚类实验,实验结果表明,提出的文本相似度计算方法,在F-度量值标准上优于传统的TF-IDF方法以及另一种基于词项语义信息的相似度量方法。

关键词: 语言网络, 本聚类, 文本相似度, 词语相似度

Abstract: Aiming at the shotcoming of traditional text similarity methods with statistical information of word frequency and semantic information of word in text, it proposes a new text similarity calculation based on language network and word semantic information. This new method extracts feature items based on the feature values of the word nodes in a documental language network. It also considers both the importance of feaure items and the semantic relations among feature items, and proposes to construct a semantic network of document feature items to calculate the similarity of documents. Finally it uses several K-means clustering methods for evaluating preformance of the new text document similarity. Experimental results show that the method’s F-measure is superior to the others’ which proves that the proposed method is effictive.

Key words: language network, text clustering, text similarity, term semantic similarity