基于词项语义组合的文本相似度计算方法研究

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (19): 90-93.

基于词项语义组合的文本相似度计算方法研究

周丽杰1，于伟海2，郭成3

1.烟台职业学院电教中心，山东烟台 264670
2.烟台市普通话培训测试中心，山东烟台 264670
3.大连理工大学软件学院，辽宁大连 116620

出版日期:2016-10-01 发布日期:2016-11-18

Research on text similarity calculation strategy based on semantic combination of keywords

ZHOU Lijie1, YU Weihai2, GUO Cheng3

1.Electronic Teaching Center, Yantai Vocational College, Yantai, Shandong 264670, China
2.Yantai Normal Language Teaching Center, Yantai, Shandong 264670, China
3.School of Software Technology, Dalian University of Technology, Dalian, Liaoning 116620, China

Online:2016-10-01 Published:2016-11-18

摘要/Abstract

摘要： 文本之间在相似度比较时主要考虑关键词的匹配特性，缺乏对关键词间组合关系的深入分析。针对关键词间组合特性，按序组合的关键词数目越大，对文本之间相似度贡献越大，并提出基于关键词组合数目的非线性语义关联性函数，在LCS基础上提取文本中所有关键词组合块。将这种结合关键词组合关系的相似度比较方法运用于短文本的相似度比较中，数据采用微软语义释义语料库，实验结果表明，短文本相似度计算的准确率和F1值都有了提高，其中F1值的提高较为明显。

关键词: 关键词组合, 非线性语义关联, 语义关联函数, 文本相似度

Abstract: Similarity comparison between texts is mainly based on keywords matching, while lacking of analysis of combination relationship among keywords deeply. Aiming at the combination of keywords, the larger of the sum of keywords which appears orderly, the greater significance for the similarity comparison between texts, a novel non-linear semantic relevance function is proposed based on the sum of keywords combination cooperatively, under the foundation of LCS theory, it extracts all the combination blocks of keywords. The experimental results on an open benchmark dataset from Microsoft Research Paraphrase corpus（MSRP） show that the proposed algorithm acquires a well accuracy and F1 performance particularly compared with traditional algorithm under the circumstance of short text similarity comparison.

Key words: combination of keywords, non-linear semantic relevance, semantic relevance function, text similarity

周丽杰1，于伟海2，郭成3. 基于词项语义组合的文本相似度计算方法研究[J]. 计算机工程与应用, 2016, 52(19): 90-93.

ZHOU Lijie1, YU Weihai2, GUO Cheng3. Research on text similarity calculation strategy based on semantic combination of keywords[J]. Computer Engineering and Applications, 2016, 52(19): 90-93.

[1]	赵琪，杜彦辉，芦天亮，沈少禹. 基于Capsule-BiGRU的文本相似度分析算法[J]. 计算机工程与应用, 2021, 57(15): 171-177.
[2]	刘聪，王永利，周子韬，犹锋，张才俊. 结合触发事件及词性分析的敏感信息识别方法[J]. 计算机工程与应用, 2020, 56(20): 132-137.
[3]	宋冬云，郑瑾，张祖平. 基于混合策略的中文短文本相似度计算[J]. 计算机工程与应用, 2018, 54(12): 116-120.
[4]	程玉胜1，2，梁辉2，王一宾1，2，任勇2. 结合关键词微变和LD算法的文本相似性研究[J]. 计算机工程与应用, 2016, 52(8): 70-73.
[5]	肖和，付丽娜，姬东鸿. 神经网络与组合语义在文本相似度中的应用[J]. 计算机工程与应用, 2016, 52(7): 139-142.
[6]	詹志建，杨小平. 基于语言网络和语义信息的文本相似度计算[J]. 计算机工程与应用, 2014, 50(5): 33-38.
[7]	金春霞，周海岩. 动态向量的中文短文本聚类[J]. 计算机工程与应用, 2011, 47(33): 156-158.
[8]	程传鹏. 网络评价倾向性研究[J]. 计算机工程与应用, 2011, 47(25): 156-159.