Chinese short text similarity computation based on hybrid strategy

doi:10.3778/j.issn.1002-8331.1612-0277

Abstract

Abstract: In order to improve the accuracy of Chinese short text similarity computation, it proposes a new similarity computation method for Chinese short text based on hybrid strategy. Firstly, according to the semantic distance of words, by using of hierarchical clustering, it constructs the short text clustering binary tree, and the traditional Vector Space Model（VSM） is improved, the keyword weight text similarity is computed. Then, this paper improves traditional syntax semantic model and computes the semantic similarity of the short text by extracting principal component of sentences. Finally, the two similarities are weighted to calculate the final text similarity. The experimental results show that the proposed method gets better performance in effectiveness and is closer to people’s judgment.

Key words: short text similarity, keyword weight, hierarchical clustering, binary tree, main components

摘要： 为提高中文短文本相似度计算的准确率，提出一种新的基于混合策略的中文短文本相似度计算方法。首先，根据词语的语义距离，利用层次聚类，构建短文本聚类二叉树，改进传统的向量空间模型（VSM），计算关键词加权的文本相似度。然后，通过提取句子的主干成分对传统的基于语法语义模型的方法进行改进，得到文本主干的语义相似度；最后，对两种相似度进行加权，计算最终的文本相似度。实验结果表明，提出的方法在短文本相似度计算方面准确性更高，更加适合人们的主观判断。

关键词: 短文本相似度, 关键词权重, 层次聚类, 二叉树, 主干成分

SONG Dongyun, ZHENG Jin, ZHANG Zuping. Chinese short text similarity computation based on hybrid strategy[J]. Computer Engineering and Applications, 2018, 54(12): 116-120.

宋冬云，郑瑾，张祖平. 基于混合策略的中文短文本相似度计算[J]. 计算机工程与应用, 2018, 54(12): 116-120.

[1]	WANG Junling, LU Xinming. Video Key Frame Extraction Algorithm Based on Semantic Correlation [J]. Computer Engineering and Applications, 2021, 57(4): 192-198.
[2]	HONG Zheng, GONG Qiyuan, FENG Wenbo, LI Yihao. Unknown Application Layer Protocol Recognition Based on Adaptive Clustering [J]. Computer Engineering and Applications, 2020, 56(5): 109-117.
[3]	TAN Yuesheng, ZHANG Shiyang, WANG Jingyu. CP-ABE Attribute Revocation Scheme Based on Multi-Authorization Centers [J]. Computer Engineering and Applications, 2019, 55(13): 78-84.
[4]	WANG Xiyue1, HUANG Yipeng1, QIAN Jiahui1, HE Ling1, HUANG Hua1, YIN Heng2. Initial and final segmentation in cleft palate speech based on acoustic characteristics [J]. Computer Engineering and Applications, 2018, 54(8): 123-130.
[5]	WANG Haiyong, FENG Zhaoxu, YANG Haibo, ZHANG Jindong. Research on text extraction algorithm based on structure similarity page clustering [J]. Computer Engineering and Applications, 2018, 54(11): 122-127.
[6]	WANG Yonggui1, ZHANG Yan1, YANG Dongdong2. Research on algorithm of community discovery of wireless city based on MapReduce [J]. Computer Engineering and Applications, 2017, 53(4): 106-112.
[7]	LAI Songxuan, LI Yanxiong. Generating initial clusters for speaker clustering [J]. Computer Engineering and Applications, 2017, 53(3): 149-153.
[8]	XU Raoshan1，2, WANG Shuang2，3, SUN Zhengxing2. Self-organization method for artistic images based on visual similarity computation [J]. Computer Engineering and Applications, 2017, 53(18): 163-169.
[9]	CAI Rong, QIAN Dong, WANG Dandan, ZHU Ping. E-gene signature method with biological and physical characteristics—case in p53 gene family [J]. Computer Engineering and Applications, 2017, 53(13): 155-159.
[10]	HE Ke, WU Xiaojun, ZHANG Yumei. Topology research of unstructured P2P network based on node of interest [J]. Computer Engineering and Applications, 2016, 52(9): 102-107.
[11]	LIN Yi, KONG Binqiang. Time series piecewise linear representation of fixed section number based on multi scale [J]. Computer Engineering and Applications, 2016, 52(21): 81-87.
[12]	KANG Qian1, LI Deyu1，2, WANG Suge1，2, JI Qingbin1. Community detection algorithm based on hierarchical clustering under signal missing in propagating process [J]. Computer Engineering and Applications, 2015, 51(9): 201-206.
[13]	SUN Haojun, SHAN Guanghui, GAO Yulong, YUAN Ting. Algorithm for clustering of high-dimensional data mixed with numeric and categorical attributes [J]. Computer Engineering and Applications, 2015, 51(8): 128-133.
[14]	ZHANG Long, ZHANG Lei, XIONG Guoliang, ZHOU Jianmin, ZHOU Jihui. Method of binary tree structure based multiple classifier fusion in bearing fault diagnosis [J]. Computer Engineering and Applications, 2015, 51(21): 243-249.
[15]	ZHAO Qiang, MU Ke. Power quality disturbances classification based on combined features and BTSVM [J]. Computer Engineering and Applications, 2015, 51(10): 232-236.

Chinese short text similarity computation based on hybrid strategy

基于混合策略的中文短文本相似度计算

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics