计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (8): 70-73.

• 大数据与云计算 • 上一篇    下一篇

结合关键词微变和LD算法的文本相似性研究

程玉胜1,2,梁  辉2,王一宾1,2,任  勇2   

  1. 1.安庆师范学院 计算机与信息学院,安徽 安庆 246011
    2.安庆师范学院 统计学研究所,安徽 安庆 246011
  • 出版日期:2016-04-15 发布日期:2016-04-19

Research of text similarity combining micro variation of keywords and LD algorithm

CHENG Yusheng1,2, LIANG Hui2, WANG Yibin1,2, REN Yong2   

  1. 1.School of Computer and Information, Anqing Normal University, Anqing, Anhui 246011, China
    2.Institute of Statistics, Anqing Normal University, Anqing, Anhui 246011, China
  • Online:2016-04-15 Published:2016-04-19

摘要: 为了解决基于传统向量空间模型的文本相似性算法没有考虑向量高维及关键词的微变,而导致文本相似性计算结果不够精确的问题,提出了关键词微变情况下基于聚类和LD算法的文本相似性算法TSABCLDA(Text Similarity Algorithm Based on Clustering and LD Algorithm)。对文本进行移除数字、标点符号和停用词等预处理;采用聚类的方法约简文本中的低频词,利用LD算法计算特征词间的相似度,建立文本相似度矩阵;用特征词相似度及其权重构建的空间向量计算文本间的相似度,这样不仅考虑了关键词微变的情况,而且有效地解决了文本向量的高维问题,将其应用于文本挖掘中,能够提高相似文本的挖掘效率。实验结果表明,由于考虑了关键词微变情况,在一定的阈值范围内,该算法文本相似性的准确率得到了明显的提高。

关键词: 聚类, LD算法, 文本相似度矩阵, 向量空间模型, 文本相似性

Abstract: In order to solve the problem of the imprecise calculation result of text similarity which comes from text similarity algorithm based on traditional vector space model, it doesn’t consider vector dimension and micro variation of key word, proposes TSABCLDA(Text Similarity Algorithm Based on Clustering and LD Algorithm) with the situation of micro variation of key word. In the present work, it makes some pretreatment of removing the number, punctuation and stop word. It reduces the low-frequency words in the text with clustering method, calculates the similarity between characteristic words by LD algorithm, builds text similarity matrix. It calculates the similarity between texts by characteristic words similarity matrix and space vector which is built by weight. It not only considers the micro variation situation of key word, but also solves the high dimensional problems of text effectively. If applied to text mining, it will improve the efficiency of mining of similarity text. The experimental results show that precise of the algorithm is improved obviously with the discovery of similarity text in situation of micro variation and a certain range of threshold values.

Key words: clustering, LD algorithm, text similarity matrix, vector space model, text similarity