计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (12): 116-120.DOI: 10.3778/j.issn.1002-8331.1612-0277

• 模式识别与人工智能 • 上一篇    下一篇

基于混合策略的中文短文本相似度计算

宋冬云,郑  瑾,张祖平   

  1. 中南大学 信息科学与工程学院,长沙 410083
  • 出版日期:2018-06-15 发布日期:2018-07-03

Chinese short text similarity computation based on hybrid strategy

SONG Dongyun, ZHENG Jin, ZHANG Zuping   

  1. School of Information Science and Engineering, Central South University, Changsha 410083, China
  • Online:2018-06-15 Published:2018-07-03

摘要: 为提高中文短文本相似度计算的准确率,提出一种新的基于混合策略的中文短文本相似度计算方法。首先,根据词语的语义距离,利用层次聚类,构建短文本聚类二叉树,改进传统的向量空间模型(VSM),计算关键词加权的文本相似度。然后,通过提取句子的主干成分对传统的基于语法语义模型的方法进行改进,得到文本主干的语义相似度;最后,对两种相似度进行加权,计算最终的文本相似度。实验结果表明,提出的方法在短文本相似度计算方面准确性更高,更加适合人们的主观判断。

关键词: 短文本相似度, 关键词权重, 层次聚类, 二叉树, 主干成分

Abstract: In order to improve the accuracy of Chinese short text similarity computation, it proposes a new similarity computation method for Chinese short text based on hybrid strategy. Firstly, according to the semantic distance of words, by using of hierarchical clustering, it constructs the short text clustering binary tree, and the traditional Vector Space Model(VSM) is improved, the keyword weight text similarity is computed. Then, this paper improves traditional syntax semantic model and computes the semantic similarity of the short text by extracting principal component of sentences. Finally, the two similarities are weighted to calculate the final text similarity. The experimental results show that the proposed method gets better performance in effectiveness and is closer to people’s judgment.

Key words: short text similarity, keyword weight, hierarchical clustering, binary tree, main components