计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (21): 143-146.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

改进后缀树的中文检索结果聚类研究

袁津生,荣元媛   

  1. 北京林业大学 信息学院,北京 100083
  • 出版日期:2014-11-01 发布日期:2014-10-28

Chinese search results cluster research based on improved STC

YUAN Jinsheng, RONG Yuanyuan   

  1. College of Information, Beijing Forestry University, Beijing 100083, China
  • Online:2014-11-01 Published:2014-10-28

摘要: 检索结果聚类能够帮助用户快速定位需要查找的信息。注重进行中文文本聚类的同时生成高质量的标签,获取搜索引擎返回的网页标题和摘要,利用分词工具对文本分词,去除停用词;统一构建一棵后缀树,以词语为单位插入后缀树各节点,通过词频、词长、词性和位置几项约束条件计算各节点词语得分;合并基类取得分高的节点词作标签。实验结果显示该方法的聚类簇纯度较高,提取的标签准确且区分性较强,方便用户使用。

关键词: 检索结果聚类, 后缀树, 聚类标签, 中文检索, 聚类

Abstract: The search result clustering can help users quickly find the information needed. This paper focuses on Chinese text clustering and how to generate high quality tags. The search engine returns the webpage title and abstract. It uses text  segmentation tool to segment text, and removes stop words; it constructs a suffix tree, with words put into the suffix tree nodes. By several constraint conditions such as word frequency, word length, word and location, it calculates each node score; it combines base clusters and makes node word with high score as the label. The experimental results show this method’s clusters have high purity. The extracted labels are accurate and distinguish strongly. It’s user-friendly.

Key words: search results clustering, suffix tree, cluster label, Chinese search, clustering