基于加权二部图匹配的中文段落相似度计算

doi:10.3778/j.issn.1002-8331.1603-0302

计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (18): 95-101.DOI: 10.3778/j.issn.1002-8331.1603-0302

基于加权二部图匹配的中文段落相似度计算

张绍阳，曹家波，王子凡，曲卫东

长安大学信息工程学院，西安 710064

出版日期:2017-09-15 发布日期:2017-09-29

Chinese paragraph similarity calculated based on weighted bipartite graph match

ZHANG Shaoyang, CAO Jiabo, WANG Zifan, QU Weidong

School of Information Engineering, Chang’an University, Xi’an 710064, China

Online:2017-09-15 Published:2017-09-29

摘要/Abstract

摘要： 为了改进传统以向量空间模型（VSM）为代表的基于词频统计的方法在中文段落相似度计算时存在的精度不高问题，在基于加权二部图匹配的思想上提出了一种计算中文段落之间相似度的方法。该方法将相似度计算分为段落和句子两个层次，将句子作为简单段落看待，也使用二部图匹配进行相似度计算。首先利用句子主干词汇提取算法来提取句子的主干词汇，将主干词汇作为二部图的顶点，把主干词汇之间的相似度作为二部图顶点之间的权值系数，进行句子相似度的计算。其次，将句子作为加权二部图的顶点，把句子之间的相似度作为二部图顶点之间的权值系数，进行段落之间的相似度计算。实验结果表明，该方法与VSM相比，由于它能准确识别同义词，自动匹配两个在段落中不同位置的相似词语，因而在准确度上有了很大的提高。

关键词: 段落相似度, 句子主干提取, 二部图匹配, 向量空间模型, 中文分词

Abstract: In order to improve the low accuracy of the statistical method that is represented by the traditional Vector Space Model （VSM） and based on word frequency in Chinese paragraph similarity computing, this thesis proposes a method to compute Chinese paragraph similarity on the basis of weighted bipartite graph matching. The similarity computing method will be divided into two levels：paragraphs and sentences. Thus, sentences can be treated as paragraphs and calculated the similarity by using bipartite graph matching. First of all, it utilizes key words extraction algorithm to extract the main vocabulary backbone of the sentences, using the main vocabulary as vertex of weighted bipartite graph to calculate similarity of sentences. Secondly, it calculates the paragraph similarity by using the sentence as a vertex of weighted bipartite graph, and the similarity between sentences as the weight coefficient between the vertex of weighted bipartite graph. Experimental results show that the proposed method has been greatly increased in accuracy compared with VSM, in virtue of its ability to identify synonyms accurately and match two similar words in different locations of paragraphs automatically.

Key words: paragraph similarity, sentence keywords, bipartite graph match, vector space model, participle

张绍阳，曹家波，王子凡，曲卫东. 基于加权二部图匹配的中文段落相似度计算[J]. 计算机工程与应用, 2017, 53(18): 95-101.

ZHANG Shaoyang, CAO Jiabo, WANG Zifan, QU Weidong. Chinese paragraph similarity calculated based on weighted bipartite graph match[J]. Computer Engineering and Applications, 2017, 53(18): 95-101.

[1]	韩邦，李子臣，汤永利. 基于同态加密的全文检索方案设计与实现[J]. 计算机工程与应用, 2020, 56(21): 103-107.
[2]	涂文博，袁贞明，俞凯. 无池化层卷积神经网络的中文分词方法[J]. 计算机工程与应用, 2020, 56(2): 120-126.
[3]	叶雪梅1，2，毛雪岷1，2，夏锦春1，2，王波1，2. 文本分类TF-IDF算法的改进研究[J]. 计算机工程与应用, 2019, 55(2): 104-109.
[4]	孙宝山，李玮. 窥视孔连接的循环网络在中文分词上的研究[J]. 计算机工程与应用, 2019, 55(19): 160-165.
[5]	向广利，李安康，林香，熊彬. 基于同态加密的多关键词检索方案[J]. 计算机工程与应用, 2018, 54(2): 97-101.
[6]	成于思1，施云涛2. 面向专业领域的中文分词方法[J]. 计算机工程与应用, 2018, 54(17): 30-34.
[7]	程玉胜1，2，梁辉2，王一宾1，2，任勇2. 结合关键词微变和LD算法的文本相似性研究[J]. 计算机工程与应用, 2016, 52(8): 70-73.
[8]	赵卫锋1，2，张勤1. 非结构化中文自然语言地址描述的自动识别[J]. 计算机工程与应用, 2016, 52(23): 19-24.
[9]	李宏霞，庞晓琼. 支持多关键字分级的可搜索同态加密方案[J]. 计算机工程与应用, 2016, 52(22): 93-98.
[10]	朱艳辉，刘璟，徐叶强，田海龙，马进. 基于条件随机场的中文领域分词研究[J]. 计算机工程与应用, 2016, 52(15): 97-100.
[11]	仰孝富，齐建东，吉鹏飞，朱文飞. 一种CF树结合KNN图划分的文本聚类算法[J]. 计算机工程与应用, 2015, 51(6): 114-119.
[12]	张庆庆，刘西林. 基于依存句法关系的文本情感分类研究[J]. 计算机工程与应用, 2015, 51(22): 28-32.
[13]	史宝明1，贺元香1，吴崇正2. 主题搜索引擎中爬虫搜索策略的研究[J]. 计算机工程与应用, 2014, 50(2): 116-119.
[14]	周俊1，3，郑中华2，张炜3. 基于改进最大匹配算法的中文分词粗分方法[J]. 计算机工程与应用, 2014, 50(2): 124-128.
[15]	马雯雯1，魏文晗1，邓一贵1，2. 基于隐含语义分析的微博话题发现方法[J]. 计算机工程与应用, 2014, 50(1): 96-100.

基于加权二部图匹配的中文段落相似度计算

Chinese paragraph similarity calculated based on weighted bipartite graph match

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics