Chinese paragraph similarity calculated based on weighted bipartite graph match

doi:10.3778/j.issn.1002-8331.1603-0302

Abstract

Abstract: In order to improve the low accuracy of the statistical method that is represented by the traditional Vector Space Model （VSM） and based on word frequency in Chinese paragraph similarity computing, this thesis proposes a method to compute Chinese paragraph similarity on the basis of weighted bipartite graph matching. The similarity computing method will be divided into two levels：paragraphs and sentences. Thus, sentences can be treated as paragraphs and calculated the similarity by using bipartite graph matching. First of all, it utilizes key words extraction algorithm to extract the main vocabulary backbone of the sentences, using the main vocabulary as vertex of weighted bipartite graph to calculate similarity of sentences. Secondly, it calculates the paragraph similarity by using the sentence as a vertex of weighted bipartite graph, and the similarity between sentences as the weight coefficient between the vertex of weighted bipartite graph. Experimental results show that the proposed method has been greatly increased in accuracy compared with VSM, in virtue of its ability to identify synonyms accurately and match two similar words in different locations of paragraphs automatically.

Key words: paragraph similarity, sentence keywords, bipartite graph match, vector space model, participle

摘要： 为了改进传统以向量空间模型（VSM）为代表的基于词频统计的方法在中文段落相似度计算时存在的精度不高问题，在基于加权二部图匹配的思想上提出了一种计算中文段落之间相似度的方法。该方法将相似度计算分为段落和句子两个层次，将句子作为简单段落看待，也使用二部图匹配进行相似度计算。首先利用句子主干词汇提取算法来提取句子的主干词汇，将主干词汇作为二部图的顶点，把主干词汇之间的相似度作为二部图顶点之间的权值系数，进行句子相似度的计算。其次，将句子作为加权二部图的顶点，把句子之间的相似度作为二部图顶点之间的权值系数，进行段落之间的相似度计算。实验结果表明，该方法与VSM相比，由于它能准确识别同义词，自动匹配两个在段落中不同位置的相似词语，因而在准确度上有了很大的提高。

关键词: 段落相似度, 句子主干提取, 二部图匹配, 向量空间模型, 中文分词

ZHANG Shaoyang, CAO Jiabo, WANG Zifan, QU Weidong. Chinese paragraph similarity calculated based on weighted bipartite graph match[J]. Computer Engineering and Applications, 2017, 53(18): 95-101.

张绍阳，曹家波，王子凡，曲卫东. 基于加权二部图匹配的中文段落相似度计算[J]. 计算机工程与应用, 2017, 53(18): 95-101.

[1]	HAN Bang, LI Zichen, TANG Yongli. Design and Implementation of Full Text Retrieval Scheme Based on Homomorphic Encryption [J]. Computer Engineering and Applications, 2020, 56(21): 103-107.
[2]	YE Xuemei1，2, MAO Xuemin1，2, XIA Jinchun1，2, WANG Bo1，2. Improved Approach to TF-IDF Algorithm in Text Classification [J]. Computer Engineering and Applications, 2019, 55(2): 104-109.
[3]	JI Mingyu, WANG Chenlong, AN Xiang, MU Weiye. Method of Sentence Similarity Calculation for Intelligent Customer Service [J]. Computer Engineering and Applications, 2019, 55(13): 123-128.
[4]	XIANG Guangli, LI Ankang, LIN Xiang, XIONG Bin. Multiple keywords retrieval scheme based on homomorphic encryption [J]. Computer Engineering and Applications, 2018, 54(2): 97-101.
[5]	CHENG Yusheng1，2, LIANG Hui2, WANG Yibin1，2, REN Yong2. Research of text similarity combining micro variation of keywords and LD algorithm [J]. Computer Engineering and Applications, 2016, 52(8): 70-73.
[6]	LI Hongxia, PANG Xiaoqiong. Searchable homomorphic encryption scheme supporting multi-keyword ranking [J]. Computer Engineering and Applications, 2016, 52(22): 93-98.
[7]	YANG Xiaofu, QI Jiandong, JI Pengfei, ZHU Wenfei. New text clustering algorithm based on CF tree and KNN graph partition [J]. Computer Engineering and Applications, 2015, 51(6): 114-119.
[8]	ZHANG Qingqing, LIU Xilin. Sentiment analysis based on dependency syntactic relation [J]. Computer Engineering and Applications, 2015, 51(22): 28-32.
[9]	SHI Baoming1, HE Yuanxiang1, WU Chongzheng2. Research on search strategy of web spider in topic-oriented search engines [J]. Computer Engineering and Applications, 2014, 50(2): 116-119.
[10]	MA Wenwen1, WEI Wenhan1, DEGN Yigui1，2. Micro-blog topic detection method based on Latent Semantic Analysis [J]. Computer Engineering and Applications, 2014, 50(1): 96-100.
[11]	XING Yujuan, LI Hengjie, CAO Xiaoli, ZHANG Chengwen. Study on hierarchical text sentiment classification algorithm [J]. Computer Engineering and Applications, 2012, 48(33): 132-135.
[12]	XIA Chunyan1, CUI Guangcai2, LI Shuping1. Research on method of topic tracking [J]. Computer Engineering and Applications, 2012, 48(15): 129-132.
[13]	ZHANG Yanping1，2, LIU Chao1，2, QU Yonghua3. Text categorization model based on WCBVSM and SACA [J]. Computer Engineering and Applications, 2012, 48(11): 137-142.
[14]	JIN Xiaofeng. Intelligent information retrieval approach for large-scale collections of full-text document [J]. Computer Engineering and Applications, 2011, 47(7): 143-145.
[15]	CAI Rangjia. Tibetan corpus processing method [J]. Computer Engineering and Applications, 2011, 47(6): 138-139.

Chinese paragraph similarity calculated based on weighted bipartite graph match

基于加权二部图匹配的中文段落相似度计算

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics