计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (26): 127-130.

• 数据库、信号与信息处理 • 上一篇    下一篇

藏文语料库深加工方法研究

才藏太   

  1. 青海师范大学 计算机学院,西宁 810008
  • 出版日期:2012-09-11 发布日期:2012-09-21

Method study of deeper processing for Tibetan corpus

CAI Zangtai   

  1. School of Computer, Qinghai Normal University, Xining 810008, China
  • Online:2012-09-11 Published:2012-09-21

摘要: 随着自然语言信息处理的不断发展和完善,大规模语料文本处理已经成为计算语言学界的一个热门话题。一个重要的原因是从大规模的语料库中能够提取出所需要的知识。结合973前期项目《藏文语料库分词标注规范研究》的开发经验,论述了班智达大型藏文语料库的建设,分词标注词典库和分词标注软件的设计与实现,重点讨论了词典库的索引结构及查找算法、分词标注软件的格词分块匹配算法和还原算法。

关键词: 藏文语料库, 分词标注, 分词词典, 还原算法

Abstract: As the constant development and improvement of natural language information processing, enormous linguistic material text processing has become a hot topic in the area of computational linguistics. One important reason is that it can collect the demanding knowledge from the huge corpus. This article puts together the development experience of the 973 project——“Studies on syncopate-dimensional norms of the Tibetan corpus”, elaborates on the large-scale construction of the Banzhiada Tibetan corpus, the design and the realization of the syncopate-dimensional dictionary storehouse and the syncopate-dimensional software. It mainly discusses the index structure and the lookup algorithm of the dictionary storehouse, the matching algorithm case auxiliary words block and the decompression algorithm of syncopate-dimensional software.

Key words: Tibetan corpus, segmentation and tag, segmentation dictionary, decompression algorithm