Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (18): 95-101.DOI: 10.3778/j.issn.1002-8331.1603-0302

Previous Articles     Next Articles

Chinese paragraph similarity calculated based on weighted bipartite graph match

ZHANG Shaoyang, CAO Jiabo, WANG Zifan, QU Weidong   

  1. School of Information Engineering, Chang’an University, Xi’an 710064, China
  • Online:2017-09-15 Published:2017-09-29

基于加权二部图匹配的中文段落相似度计算

张绍阳,曹家波,王子凡,曲卫东   

  1. 长安大学 信息工程学院,西安 710064

Abstract: In order to improve the low accuracy of the statistical method that is represented by the traditional Vector Space Model (VSM) and based on word frequency in Chinese paragraph similarity computing, this thesis proposes a method to compute Chinese paragraph similarity on the basis of weighted bipartite graph matching. The similarity computing method will be divided into two levels:paragraphs and sentences. Thus, sentences can be treated as paragraphs and calculated the similarity by using bipartite graph matching. First of all, it utilizes key words extraction algorithm to extract the main vocabulary backbone of the sentences, using the main vocabulary as vertex of weighted bipartite graph to calculate similarity of sentences. Secondly, it calculates the paragraph similarity by using the sentence as a vertex of weighted bipartite graph, and the similarity between sentences as the weight coefficient between the vertex of weighted bipartite graph. Experimental results show that the proposed method has been greatly increased in accuracy compared with VSM, in virtue of its ability to identify synonyms accurately and match two similar words in different locations of paragraphs automatically.

Key words: paragraph similarity, sentence keywords, bipartite graph match, vector space model, participle

摘要: 为了改进传统以向量空间模型(VSM)为代表的基于词频统计的方法在中文段落相似度计算时存在的精度不高问题,在基于加权二部图匹配的思想上提出了一种计算中文段落之间相似度的方法。该方法将相似度计算分为段落和句子两个层次,将句子作为简单段落看待,也使用二部图匹配进行相似度计算。首先利用句子主干词汇提取算法来提取句子的主干词汇,将主干词汇作为二部图的顶点,把主干词汇之间的相似度作为二部图顶点之间的权值系数,进行句子相似度的计算。其次,将句子作为加权二部图的顶点,把句子之间的相似度作为二部图顶点之间的权值系数,进行段落之间的相似度计算。实验结果表明,该方法与VSM相比,由于它能准确识别同义词,自动匹配两个在段落中不同位置的相似词语,因而在准确度上有了很大的提高。

关键词: 段落相似度, 句子主干提取, 二部图匹配, 向量空间模型, 中文分词