Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (15): 171-177.DOI: 10.3778/j.issn.1002-8331.2004-0253

Previous Articles     Next Articles

Algorithm of Text Similarity Analysis Based on Capsule-BiGRU

ZHAO Qi, DU Yanhui, LU Tianliang, SHEN Shaoyu   

  1. School of Police Information Engineering and Cyber Security, People’s Public Security University of China, Beijing 100038, China
  • Online:2021-08-01 Published:2021-07-26

基于Capsule-BiGRU的文本相似度分析算法

赵琪,杜彦辉,芦天亮,沈少禹   

  1. 中国人民公安大学 警务信息工程与网络安全学院,北京 100038

Abstract:

Aiming at the problem that the traditional neural network model cannot extract the features of the text well, a text similarity analysis method based on capsule-BiGRU is proposed. The local features matrix of the text extracted by the capsule network and the global features matrix of the text extracted by the BiGRU are analyzed for similarity separately to obtain the similarity matrix of the text, to judge the similarity of text. The traditional capsule network is improved, words that have nothing to do with text semantics are regarded as noise capsules, and smaller weights are assigned to reduce the impact on subsequent tasks. For the task of text similarity, a co-attention?mechanism is added before feature extraction. For two texts to be analyzed, weights are given by calculating the similarity between words in one text and all words in another text, so that determine the similarity of text more accurately. Experiment with the Quora Questions Pairs dataset. The experimental results show that the proposed method has an accuracy rate of 86.16% and an F1 value of 88.77%, which is better than other methods.

Key words: text similarity, capsule, BiGRU, attention mechanism

摘要:

针对传统神经网络模型不能很好地提取文本特征的问题,提出基于capsule-BiGRU的文本相似度分析方法,该方法将胶囊网络(capsule)提取的文本的局部特征矩阵和双向门控循环单元网络(BiGRU)提取的文本的全局特征矩阵分别进行相似度分析,得到文本的相似度矩阵,将相似度矩阵融合,得到两个文本的多层次相似度向量,从而进行文本相似度的判定。将传统的胶囊网络进行改进,把与文本语义无关的单词视为噪声胶囊,赋予较小权值,从而减轻对后续任务的影响。针对文本相似度的任务,在文本特征矩阵提取前加入互注意力机制,对于待分析的两个文本,通过计算一个文本中单词与另一文本中所有单词的相似度来对词向量赋予权值,从而能更准确地判断文本的相似度。在Quora Questions Pairs数据集进行实验,实验结果表明所提出的方法准确率为86.16%,F1值为88.77%,结果优于其他方法。

关键词: 文本相似度, 胶囊网络, 双向门控循环单元网络, 注意力机制