结合关键词微变和LD算法的文本相似性研究

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (8): 70-73.

结合关键词微变和LD算法的文本相似性研究

程玉胜1，2，梁辉2，王一宾1，2，任勇2

1.安庆师范学院计算机与信息学院，安徽安庆 246011
2.安庆师范学院统计学研究所，安徽安庆 246011

出版日期:2016-04-15 发布日期:2016-04-19

Research of text similarity combining micro variation of keywords and LD algorithm

CHENG Yusheng1，2, LIANG Hui2, WANG Yibin1，2, REN Yong2

1.School of Computer and Information, Anqing Normal University, Anqing, Anhui 246011, China
2.Institute of Statistics, Anqing Normal University, Anqing, Anhui 246011, China

Online:2016-04-15 Published:2016-04-19

摘要/Abstract

摘要： 为了解决基于传统向量空间模型的文本相似性算法没有考虑向量高维及关键词的微变，而导致文本相似性计算结果不够精确的问题，提出了关键词微变情况下基于聚类和LD算法的文本相似性算法TSABCLDA（Text Similarity Algorithm Based on Clustering and LD Algorithm）。对文本进行移除数字、标点符号和停用词等预处理；采用聚类的方法约简文本中的低频词，利用LD算法计算特征词间的相似度，建立文本相似度矩阵；用特征词相似度及其权重构建的空间向量计算文本间的相似度，这样不仅考虑了关键词微变的情况，而且有效地解决了文本向量的高维问题，将其应用于文本挖掘中，能够提高相似文本的挖掘效率。实验结果表明，由于考虑了关键词微变情况，在一定的阈值范围内，该算法文本相似性的准确率得到了明显的提高。

关键词: 聚类, LD算法, 文本相似度矩阵, 向量空间模型, 文本相似性

Abstract: In order to solve the problem of the imprecise calculation result of text similarity which comes from text similarity algorithm based on traditional vector space model, it doesn’t consider vector dimension and micro variation of key word, proposes TSABCLDA（Text Similarity Algorithm Based on Clustering and LD Algorithm） with the situation of micro variation of key word. In the present work, it makes some pretreatment of removing the number, punctuation and stop word. It reduces the low-frequency words in the text with clustering method, calculates the similarity between characteristic words by LD algorithm, builds text similarity matrix. It calculates the similarity between texts by characteristic words similarity matrix and space vector which is built by weight. It not only considers the micro variation situation of key word, but also solves the high dimensional problems of text effectively. If applied to text mining, it will improve the efficiency of mining of similarity text. The experimental results show that precise of the algorithm is improved obviously with the discovery of similarity text in situation of micro variation and a certain range of threshold values.

Key words: clustering, LD algorithm, text similarity matrix, vector space model, text similarity

程玉胜1，2，梁辉2，王一宾1，2，任勇2. 结合关键词微变和LD算法的文本相似性研究[J]. 计算机工程与应用, 2016, 52(8): 70-73.

CHENG Yusheng1，2, LIANG Hui2, WANG Yibin1，2, REN Yong2. Research of text similarity combining micro variation of keywords and LD algorithm[J]. Computer Engineering and Applications, 2016, 52(8): 70-73.

[1]	兰红，黄敏. 融合KNN优化的密度峰值和FCM聚类算法[J]. 计算机工程与应用, 2021, 57(9): 81-88.
[2]	郭晓静，隋昊达. 改进YOLOv3在机场跑道异物目标检测中的应用[J]. 计算机工程与应用, 2021, 57(8): 249-255.
[3]	李莉，纪欣沅，宋嵩. 回环软件缺陷数量预测模型[J]. 计算机工程与应用, 2021, 57(7): 158-163.
[4]	雍玖，王阳萍，党建武，雷晓妹. 改进TLD与ORB的AR系统长时跟踪注册方法[J]. 计算机工程与应用, 2021, 57(7): 178-184.
[5]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[6]	杨芳，尹曦，司建辉，刘宏媛，汪雪. 基于侧重点聚类的数学表达式相似度计算方法[J]. 计算机工程与应用, 2021, 57(6): 88-93.
[7]	赵凡，张琳，闻治泉，杨林林，蔺广逢. 一种直接高效的自然场景汉字逼近定位方法[J]. 计算机工程与应用, 2021, 57(6): 159-167.
[8]	彭启慧，宣士斌，高卿. 分布的自动阈值密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(5): 71-78.
[9]	李勇振，廖湖声. 基于图卷积神经网络的多视角聚类[J]. 计算机工程与应用, 2021, 57(5): 115-122.
[10]	王昌龙，张远东，缪宏，杨煜恒. 双通道卷积神经网络在南瓜病害识别上的应用[J]. 计算机工程与应用, 2021, 57(5): 183-189.
[11]	胡晓敏，王明丰，张首荣，李敏. 用于文本聚类的新型差分进化粒子群算法[J]. 计算机工程与应用, 2021, 57(4): 61-67.
[12]	王俊玲，卢新明. 基于语义相关的视频关键帧提取算法[J]. 计算机工程与应用, 2021, 57(4): 192-198.
[13]	王芙银，张德生，张晓. 结合鲸鱼优化算法的自适应密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(3): 94-102.
[14]	陈俊丰，郑中团. WKMeans与SMOTE结合的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(23): 106-112.
[15]	张忠林，赵昱，闫光辉. 自然邻居密度极值聚类算法[J]. 计算机工程与应用, 2021, 57(23): 200-210.