Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (12): 157-159.DOI: 10.3778/j.issn.1002-8331.2009.12.051

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Chinese analyzer for search engine-Lucene

HU Chang-chun,LIU Gong-shen

  

  1. School of Information Security Engineering,Shanghai Jiao Tong University,Shanghai 200240,China
  • Received:2008-03-28 Revised:2008-04-28 Online:2009-04-21 Published:2009-04-21
  • Contact: HU Chang-chun

面向搜索引擎Lucene的中文分析器

胡长春,刘功申   

  1. 上海交通大学 信息安全工程学院,上海 200240
  • 通讯作者: 胡长春

Abstract: The word segmentation algorithm of most Chinese analyzers for the Lucene search engine does not meet the Chinese habit.In order to overcome such deficiency,this paper has proposed a new Chinese analyzer based on the maximal match algorithm and a standard dictionary.From the experimental results,the proposed word segmentation algorithm of our Chinese analyzer meets the Chinese habit.And its indexing performance is very close to that of the analyzers based on mechanical segmentation.In addition,the retrieval efficiency is greatly improved by 2~4 times and the rate of retrieval response is improved by 59%.

摘要: 针对目前应用于搜索引擎Lucene的中文分析器的分词不符合汉语习惯的现状,根据正向最大匹配切分算法和采用包括基本标准中文词语的词库,实现了自己的分析器。该分析器的分词结果更符合汉语的习惯,并且在分词、建立索引等方面的性能非常接近基于机械分词的分析器,另外在检索速度方面性能提升了2~4倍,在检索召回率方面性能提升了59%。