计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (19): 90-93.

• 大数据与云计算 • 上一篇    下一篇

基于词项语义组合的文本相似度计算方法研究

周丽杰1,于伟海2,郭  成3   

  1. 1.烟台职业学院 电教中心,山东 烟台 264670
    2.烟台市普通话培训测试中心,山东 烟台 264670
    3.大连理工大学 软件学院,辽宁 大连 116620
  • 出版日期:2016-10-01 发布日期:2016-11-18

Research on text similarity calculation strategy based on semantic combination of keywords

ZHOU Lijie1, YU Weihai2, GUO Cheng3   

  1. 1.Electronic Teaching Center, Yantai Vocational College, Yantai, Shandong 264670, China
    2.Yantai Normal Language Teaching Center, Yantai, Shandong 264670, China
    3.School of Software Technology, Dalian University of Technology, Dalian, Liaoning 116620, China
  • Online:2016-10-01 Published:2016-11-18

摘要: 文本之间在相似度比较时主要考虑关键词的匹配特性,缺乏对关键词间组合关系的深入分析。针对关键词间组合特性,按序组合的关键词数目越大,对文本之间相似度贡献越大,并提出基于关键词组合数目的非线性语义关联性函数,在LCS基础上提取文本中所有关键词组合块。将这种结合关键词组合关系的相似度比较方法运用于短文本的相似度比较中,数据采用微软语义释义语料库,实验结果表明,短文本相似度计算的准确率和F1值都有了提高,其中F1值的提高较为明显。

关键词: 关键词组合, 非线性语义关联, 语义关联函数, 文本相似度

Abstract: Similarity comparison between texts is mainly based on keywords matching, while lacking of analysis of combination relationship among keywords deeply. Aiming at the combination of keywords, the larger of the sum of keywords which appears orderly, the greater significance for the similarity comparison between texts, a novel non-linear semantic relevance function is proposed based on the sum of keywords combination cooperatively, under the foundation of LCS theory, it extracts all the combination blocks of keywords. The experimental results on an open benchmark dataset from Microsoft Research Paraphrase corpus(MSRP) show that the proposed algorithm acquires a well accuracy and F1 performance particularly compared with traditional algorithm under the circumstance of short text similarity comparison.

Key words: combination of keywords, non-linear semantic relevance, semantic relevance function, text similarity