Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (28): 135-137.DOI: 10.3778/j.issn.1002-8331.2010.28.038

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Real-time retrieval in Chinese webpage by using keywords inverted table

WANG Yuan-ding,LIANG Jiu-zhen   

  1. School of Information Engineering,Jiangnan University,Wuxi,Jiangsu 214122,China
  • Received:2009-02-27 Revised:2009-04-13 Online:2010-10-01 Published:2010-10-01
  • Contact: WANG Yuan-ding

利用关键词倒排表实时检索中文网页

王远定,梁久祯   

  1. 江南大学 信息工程学院,江苏 无锡 214122
  • 通讯作者: 王远定

Abstract: The paper studies fast retrieval technique of Chinese webpage based on inverted keywords.Under the premise of establishing a large of webpage corpus,the webpage keyword feature vectors are generated by using the keyword dictionary and the optimized forward largest segmentation algorithm in the status of offline.Then a compressed format of the webpage feature table is produced by dimension reducing on the feature vectors.Finally,an inverted keyword file is established according to the frequency of the keywords reference in all of the webpage and the webpage feature table.In the experiment,by contrastively accessing three data sources,namely the original webpage database,the feature table and the inverted file,the retrievals of the Chinese webpage keywords are implemented respectively,and comparison of the three retrieval methods are given on testing the real-time ability.The experiment shows that,the inverted file retrieval algorithm based on keywords is enormously superior on real-time to the other two methods.

Key words: retrieval, webpage feature table, inverted file, real-time

摘要: 研究了基于关键词倒排表的中文网页快速检索方法。在建立大量网页语料库的前提下,利用关键词词典和优化后的前向最大切词算法脱机生成网页关键词特征向量,然后对网页特征向量作维数压缩生成压缩格式的网页特征表,最后利用网页特征表根据关键词在所有网页中出现的频率统计生成关键词倒排文件。实验中,通过对比访问网页库、特征表和倒排文件三种不同的数据来源,分别实现了中文网页的关键词检索,比较了三种数据源检索的实时性。实验表明,基于关键词的倒排表检索算法大大优于其他两种方法,具有很好的实时性。

关键词: 检索, 网页特征表, 倒排文件, 实时性

CLC Number: