基于频繁词义序列的检索结果聚类算法研究

计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (1): 13-20.

基于频繁词义序列的检索结果聚类算法研究

王晓博，李晓，马博

中科院新疆理化技术研究所多语种信息技术研究室，乌鲁木齐 830011

出版日期:2015-01-01 发布日期:2015-01-06

Search result clustering algorithm based on frequent itemsets meaning sequence

WANG Xiaobo, LI Xiao, MA Bo

The Xinjiang Technical Institute of Physics & Chemistry, CAS, Urumqi 830011, China

Online:2015-01-01 Published:2015-01-06

摘要/Abstract

摘要： 目前大多搜索引擎结果聚类算法针对用户查询生成的网页摘要进行聚类，由于网页摘要较短且质量良莠不齐，聚类效果难以保证。提出了一种基于频繁词义序列的检索结果聚类算法，利用WordNet结合句法和语义特征对搜索结果构建聚类及标签。不像传统的基于向量空间模型的聚类算法，考虑了词语在文档中的序列模式。算法首先对文本进行预处理，生成压缩文档以降低文本数据维度，构建广义后缀树，挖掘出最大频繁项集，然后获取频繁词义序列。从文档中获取的有序频繁项集可以更好地反映文档的主题，把相同主题的搜索结果聚类在一起，与用户查询相关度高的优先排序。实验表明，该算法可以获得与查询相关的高质量聚类及基于语义的聚类标签，具有更高的聚类准确度和更高的运行效率，并且可扩展性良好。

关键词: 聚类算法, 频繁项, 信息检索, WordNet

Abstract: Most of existing web page clustering algorithms are based on short and uneven snippets of web pages, which often cause bad clustering performance. This paper presents a clustering algorithm based on frequent itemsets meaning sequence, which combines the use of WordNet syntactic and semantic features to build the search results clustering and labeling. Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. A word（meaning）sequence is frequent if it occurs in more than certain percentage of the documents in the text database. Firstly, the text is pre-processed to generate compact document to reduce the dimension of the document, build generalized suffix tree, and dig out the maximum frequent itemsets, then the frequent word meaning sequences is generated. Document theme can be better reflected by frequent itemsets meaning sequence, the search results having same themes clustered together with the user's query prioritization highly relevant. Experimental results show that the clustering algorithm can obtain a high quality cluster that related to the query semantic tags, which has higher accuracy, efficiency and good scalability.

Key words: clustering algorithm, frequent itemset, information retrieval, WordNet

王晓博，李晓，马博. 基于频繁词义序列的检索结果聚类算法研究[J]. 计算机工程与应用, 2015, 51(1): 13-20.

WANG Xiaobo, LI Xiao, MA Bo. Search result clustering algorithm based on frequent itemsets meaning sequence[J]. Computer Engineering and Applications, 2015, 51(1): 13-20.

[1]	王俊玲，卢新明. 基于语义相关的视频关键帧提取算法[J]. 计算机工程与应用, 2021, 57(4): 192-198.
[2]	王芙银，张德生，张晓. 结合鲸鱼优化算法的自适应密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(3): 94-102.
[3]	张子然，黄卫华，陈阳，章政，李梓远. 基于双向搜索的改进蚁群路径规划算法[J]. 计算机工程与应用, 2021, 57(21): 270-277.
[4]	丁松阳，田青云. Ball-Tree优化的密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(20): 90-96.
[5]	翁玉尚，肖金球，夏禹. 改进Mask R-CNN算法的带钢表面缺陷检测[J]. 计算机工程与应用, 2021, 57(19): 235-242.
[6]	白璐，赵鑫，孔钰婷，张正航，邵金鑫，钱育蓉. 谱聚类算法研究综述[J]. 计算机工程与应用, 2021, 57(14): 15-26.
[7]	相益萱，姜合，潘品臣，孙聪慧. 二次幂耦合的[K]-means聚类算法研究[J]. 计算机工程与应用, 2021, 57(14): 95-102.
[8]	韩纪普，段先华，常振. 基于SLIC和区域生长的目标分割算法[J]. 计算机工程与应用, 2021, 57(1): 213-218.
[9]	李杰其，胡良兵. 基于机器学习的设备预测性维护方法综述[J]. 计算机工程与应用, 2020, 56(21): 11-19.
[10]	孙志冉，苏航，梁毅. 一种改进的K-Prototypes聚类算法[J]. 计算机工程与应用, 2020, 56(21): 54-59.
[11]	岳晓新，贾君霞，陈喜东，李广安. 改进YOLO V3的道路小目标检测[J]. 计算机工程与应用, 2020, 56(21): 218-223.
[12]	顾军华，苏鸣，张亚娟，张丹红. 基于位编码链表的快速频繁模式挖掘算法研究[J]. 计算机工程与应用, 2020, 56(19): 86-93.
[13]	郭永坤，章新友，刘莉萍，丁亮，牛晓录. 优化初始聚类中心的K-means聚类算法[J]. 计算机工程与应用, 2020, 56(15): 172-178.
[14]	贾露，张德生，吕端端. 物理学优化的密度峰值聚类算法[J]. 计算机工程与应用, 2020, 56(13): 47-53.
[15]	樊晓博，张慧军，张小龙. 企业日志数据的交互式可视分析方法研究[J]. 计算机工程与应用, 2019, 55(23): 248-256.