Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (26): 146-150.

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Topology-based document similarity search algorithm

YANG Yan1,2,ZHU Ge1,FAN Wenbin1   

  1. 1.School of Computer Science and Technology,Heilongjiang University,Harbin 150080,China
    2.The Key Laboratory of Computational Biology,Heilongjiang University,Harbin 150080,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-09-11 Published:2011-09-11

一种基于文档拓扑的相似性搜索算法

杨 艳1,2,朱 戈1,范文彬1   

  1. 1.黑龙江大学 计算机科学技术学院,哈尔滨 150080
    2.黑龙江大学 计算生物学重点实验室,哈尔滨 150080

Abstract: Searching for similar documents from the large number of documents quickly and efficiently is an important and time-consuming problem.The existing algorithms first find the candidate document set,and then sort them based on a document related evaluation to identify the most relevant ones.A topology-based document similarity search algorithm——Hub-N is put forward,and the document similarity search problem is transformed into graph search problem,applying the pruning techniques,reducing the scope of scanned documents,and significantly improving retrieval efficiency.It proves to be effective and feasible through experiment.

Key words: document topology, similarity search, similarity

摘要: 从海量文档中快速有效地搜索到相似文档是一个重要且耗时的问题。现有的文档相似性搜索算法是先找出候选文档集,再对候选文档进行相关性排序,找出最相关的文档。提出了一种基于文档拓扑的相似性搜索算法——Hub-N,将文档相似性搜索问题转化为图搜索问题,应用相应的剪枝技术,缩小了扫描文档的范围,提高了搜索效率。通过实验验证了算法的有效性和可行性。

关键词: 文档拓扑, 相似性搜索, 相似度