Text similarity calculation based on language network and semantic information

Abstract

Abstract: Aiming at the shotcoming of traditional text similarity methods with statistical information of word frequency and semantic information of word in text, it proposes a new text similarity calculation based on language network and word semantic information. This new method extracts feature items based on the feature values of the word nodes in a documental language network. It also considers both the importance of feaure items and the semantic relations among feature items, and proposes to construct a semantic network of document feature items to calculate the similarity of documents. Finally it uses several K-means clustering methods for evaluating preformance of the new text document similarity. Experimental results show that the method’s F-measure is superior to the others’ which proves that the proposed method is effictive.

Key words: language network, text clustering, text similarity, term semantic similarity

摘要： 通过分析已有的基于统计和基于语义分析的文本相似性度量方法的不足，提出了一种新的基于语言网络和词项语义信息的文本相似度计算方法。对文本建立语言网络，计算网络节点综合特征值，选取TOP比例特征词表征文本，有效降低文本表示维度。计算TOP比例特征词间的相似度，以及这些词的综合特征值所占百分比以计算文本之间的相似度。利用提出的相似度计算方法在数据集上进行聚类实验，实验结果表明，提出的文本相似度计算方法，在F-度量值标准上优于传统的TF-IDF方法以及另一种基于词项语义信息的相似度量方法。

关键词: 语言网络, 本聚类, 文本相似度, 词语相似度

ZHAN Zhijian, YANG Xiaoping. Text similarity calculation based on language network and semantic information[J]. Computer Engineering and Applications, 2014, 50(5): 33-38.

詹志建，杨小平. 基于语言网络和语义信息的文本相似度计算[J]. 计算机工程与应用, 2014, 50(5): 33-38.

[1]	HUO Guangyu, ZHANG Yong, SUN Yanfeng, YIN Baocai. Research on Archive Data Intelligent Classification Based on Semantic [J]. Computer Engineering and Applications, 2021, 57(6): 247-253.
[2]	HU Xiaomin, WANG Mingfeng, ZHANG Shourong, LI Min. New Differential Evolution with Particle Swarm Optimization Algorithm for Text Clustering [J]. Computer Engineering and Applications, 2021, 57(4): 61-67.
[3]	ZHAO Qi, DU Yanhui, LU Tianliang, SHEN Shaoyu. Algorithm of Text Similarity Analysis Based on Capsule-BiGRU [J]. Computer Engineering and Applications, 2021, 57(15): 171-177.
[4]	PAN Chengsheng, ZHANG Bin, LYU Yana, DU Xiuli, QIU Shaoming. K-Means Text Clustering Based on Improved Gray Wolf Optimization Algorithm [J]. Computer Engineering and Applications, 2021, 57(1): 188-193.
[5]	ZHANG Weiwei, HU Yaqi, ZHAI Guangyu, LIU Zhipeng. Academic Abstract Clustering Method Based on LDA Model and Doc2vec [J]. Computer Engineering and Applications, 2020, 56(6): 180-185.
[6]	LIU Cong, WANG Yongli, ZHOU Zitao, YOU Feng, ZHANG Caijun. Sensitive Information Recognition Method Combining Trigger Event and Part of Speech Analysis [J]. Computer Engineering and Applications, 2020, 56(20): 132-137.
[7]	ZHANG Yunchun, ZHANG Kun, XU Jiming, YUAN Weiping, CAI Ying, GAO Ya. Multi-document Summary Generation Algorithm Based on Graph Model [J]. Computer Engineering and Applications, 2020, 56(16): 124-131.
[8]	SONG Dongyun, ZHENG Jin, ZHANG Zuping. Chinese short text similarity computation based on hybrid strategy [J]. Computer Engineering and Applications, 2018, 54(12): 116-120.
[9]	WANG Binyu1, LIU Wenfen2, HU Xuexian1, WEI Jianghong1. Research on text clustering for selecting initial cluster center based on Cosine distance [J]. Computer Engineering and Applications, 2018, 54(10): 11-18.
[10]	XIE Chenyang1，LU Yanxin2. Supervise multi-label text classification based on hierarchical dirichlet process [J]. Computer Engineering and Applications, 2017, 53(23): 18-23.
[11]	CHENG Yusheng1，2, LIANG Hui2, WANG Yibin1，2, REN Yong2. Research of text similarity combining micro variation of keywords and LD algorithm [J]. Computer Engineering and Applications, 2016, 52(8): 70-73.
[12]	XIAO He, FU Lina, JI Donghong. Neural language model and semantic compositionality model in semantic similarity [J]. Computer Engineering and Applications, 2016, 52(7): 139-142.
[13]	ZHOU Lijie1, YU Weihai2, GUO Cheng3. Research on text similarity calculation strategy based on semantic combination of keywords [J]. Computer Engineering and Applications, 2016, 52(19): 90-93.
[14]	LI Xin1, WANG Suge1，2, LI Deyu1，2. Dimension identification method for text sentiment clustering [J]. Computer Engineering and Applications, 2015, 51(7): 124-130.
[15]	YANG Xiaofu, QI Jiandong, JI Pengfei, ZHU Wenfei. New text clustering algorithm based on CF tree and KNN graph partition [J]. Computer Engineering and Applications, 2015, 51(6): 114-119.

Text similarity calculation based on language network and semantic information

基于语言网络和语义信息的文本相似度计算

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics