Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (19): 139-141.DOI: 10.3778/j.issn.1002-8331.2009.19.043

• 数据库、信息处理 • Previous Articles     Next Articles

Chinese word segmentation dictionary using two-level index

ZHANG Qing-yang,CHAI Sheng   

  1. Department of Computer Science and Technology,Jilin University,Changchun 130062,China
  • Received:2008-04-15 Revised:2008-07-23 Online:2009-07-01 Published:2009-07-01
  • Contact: ZHANG Qing-yang

使用二级索引的中文分词词典

张庆扬,柴 胜   

  1. 吉林大学 计算机科学与技术系,长春 130062
  • 通讯作者: 张庆扬

Abstract: As the basis of Chinese information processing,Chinese word segmentation plays a very important role in the fields of searching engine,automatic and so on.Chinese word dictionary is the basis of mechanic segmentation algorithm,it tells the algorithm what is a Chinese word.Because the algorithm needs the content of dictionary in order to match the string in the text,the storage structure of the dictionary will decide the method of the algorithm and its performance.Through making research into the existed theory and refinement,this paper adds multi-level index for the dictionary,and based on this formulates a new mechanism of Chinese word segmentation dictionary—dictionary based on two-level index.On the basis of this new theory,this paper also improves the positive matching method,reduces the complexity of matching process,moreover,elevates the speed of the segmentation.

摘要: 中文分词是中文信息处理的基础,在诸如搜索引擎,自动翻译等多个领域都有着非常重要的地位。中文分词词典是中文机械式分词算法的基础,它将告诉算法什么是词,由于在算法执行过程中需要反复利用分词词典的内容进行字符串匹配,所以中文分词词典的存储结构从很大程度上决定将采用什么匹配算法以及匹配算法的好坏。在研究现存分词词典及匹配算法的基础上,吸取前人的经验经过改进,为词典加上了多级索引,并由此提出了一种新的中文分词词典存储机制——基于二级索引的中文分词词典,并在该词典的基础上提出了基于正向匹配的改进型匹配算法,大大降低了匹配过程的时间复杂度。从而提高了整个中文分词算法的分词速度。