基于语言网络和语义信息的文本相似度计算

计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (5): 33-38.

基于语言网络和语义信息的文本相似度计算

詹志建，杨小平

中国人民大学信息学院计算机系，北京 100872

出版日期:2014-03-01 发布日期:2015-05-12

Text similarity calculation based on language network and semantic information

ZHAN Zhijian, YANG Xiaoping

Department of Computer, School of Information, Renmin University of China, Beijing 100872, China

Online:2014-03-01 Published:2015-05-12

摘要/Abstract

摘要： 通过分析已有的基于统计和基于语义分析的文本相似性度量方法的不足，提出了一种新的基于语言网络和词项语义信息的文本相似度计算方法。对文本建立语言网络，计算网络节点综合特征值，选取TOP比例特征词表征文本，有效降低文本表示维度。计算TOP比例特征词间的相似度，以及这些词的综合特征值所占百分比以计算文本之间的相似度。利用提出的相似度计算方法在数据集上进行聚类实验，实验结果表明，提出的文本相似度计算方法，在F-度量值标准上优于传统的TF-IDF方法以及另一种基于词项语义信息的相似度量方法。

关键词: 语言网络, 本聚类, 文本相似度, 词语相似度

Abstract: Aiming at the shotcoming of traditional text similarity methods with statistical information of word frequency and semantic information of word in text, it proposes a new text similarity calculation based on language network and word semantic information. This new method extracts feature items based on the feature values of the word nodes in a documental language network. It also considers both the importance of feaure items and the semantic relations among feature items, and proposes to construct a semantic network of document feature items to calculate the similarity of documents. Finally it uses several K-means clustering methods for evaluating preformance of the new text document similarity. Experimental results show that the method’s F-measure is superior to the others’ which proves that the proposed method is effictive.

Key words: language network, text clustering, text similarity, term semantic similarity

詹志建，杨小平. 基于语言网络和语义信息的文本相似度计算[J]. 计算机工程与应用, 2014, 50(5): 33-38.

ZHAN Zhijian, YANG Xiaoping. Text similarity calculation based on language network and semantic information[J]. Computer Engineering and Applications, 2014, 50(5): 33-38.

[1]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[2]	胡晓敏，王明丰，张首荣，李敏. 用于文本聚类的新型差分进化粒子群算法[J]. 计算机工程与应用, 2021, 57(4): 61-67.
[3]	赵琪，杜彦辉，芦天亮，沈少禹. 基于Capsule-BiGRU的文本相似度分析算法[J]. 计算机工程与应用, 2021, 57(15): 171-177.
[4]	潘成胜，张斌，吕亚娜，杜秀丽，邱少明. 改进灰狼优化算法的K-Means文本聚类[J]. 计算机工程与应用, 2021, 57(1): 188-193.
[5]	张卫卫，胡亚琦，翟广宇，刘志鹏. 基于LDA模型和Doc2vec的学术摘要聚类方法[J]. 计算机工程与应用, 2020, 56(6): 180-185.
[6]	刘聪，王永利，周子韬，犹锋，张才俊. 结合触发事件及词性分析的敏感信息识别方法[J]. 计算机工程与应用, 2020, 56(20): 132-137.
[7]	张云纯，张琨，徐济铭，袁卫平，蔡颖，高雅. 基于图模型的多文档摘要生成算法[J]. 计算机工程与应用, 2020, 56(16): 124-131.
[8]	宋冬云，郑瑾，张祖平. 基于混合策略的中文短文本相似度计算[J]. 计算机工程与应用, 2018, 54(12): 116-120.
[9]	王彬宇1，刘文芬2，胡学先1，魏江宏1. 基于余弦距离选取初始簇中心的文本聚类研究[J]. 计算机工程与应用, 2018, 54(10): 11-18.
[10]	郭小华1，彭琦2，邓涵1，朱新华1. 基于边权重的WordNet词语相似度计算[J]. 计算机工程与应用, 2018, 54(1): 172-178.
[11]	程玉胜1，2，梁辉2，王一宾1，2，任勇2. 结合关键词微变和LD算法的文本相似性研究[J]. 计算机工程与应用, 2016, 52(8): 70-73.
[12]	肖和，付丽娜，姬东鸿. 神经网络与组合语义在文本相似度中的应用[J]. 计算机工程与应用, 2016, 52(7): 139-142.
[13]	邱云飞，赵彬，林明明，王伟. 结合语义改进的K-means短文本聚类算法[J]. 计算机工程与应用, 2016, 52(19): 78-83.
[14]	周丽杰1，于伟海2，郭成3. 基于词项语义组合的文本相似度计算方法研究[J]. 计算机工程与应用, 2016, 52(19): 90-93.
[15]	仰孝富，齐建东，吉鹏飞，朱文飞. 一种CF树结合KNN图划分的文本聚类算法[J]. 计算机工程与应用, 2015, 51(6): 114-119.