藏文语料库深加工方法研究

计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (26): 127-130.

• 数据库、信号与信息处理 • 上一篇下一篇

藏文语料库深加工方法研究

才藏太

青海师范大学计算机学院，西宁 810008

出版日期:2012-09-11 发布日期:2012-09-21

Method study of deeper processing for Tibetan corpus

CAI Zangtai

School of Computer, Qinghai Normal University, Xining 810008, China

Online:2012-09-11 Published:2012-09-21

摘要/Abstract

摘要： 随着自然语言信息处理的不断发展和完善，大规模语料文本处理已经成为计算语言学界的一个热门话题。一个重要的原因是从大规模的语料库中能够提取出所需要的知识。结合973前期项目《藏文语料库分词标注规范研究》的开发经验，论述了班智达大型藏文语料库的建设，分词标注词典库和分词标注软件的设计与实现，重点讨论了词典库的索引结构及查找算法、分词标注软件的格词分块匹配算法和还原算法。

关键词: 藏文语料库, 分词标注, 分词词典, 还原算法

Abstract: As the constant development and improvement of natural language information processing, enormous linguistic material text processing has become a hot topic in the area of computational linguistics. One important reason is that it can collect the demanding knowledge from the huge corpus. This article puts together the development experience of the 973 project——“Studies on syncopate-dimensional norms of the Tibetan corpus”, elaborates on the large-scale construction of the Banzhiada Tibetan corpus, the design and the realization of the syncopate-dimensional dictionary storehouse and the syncopate-dimensional software. It mainly discusses the index structure and the lookup algorithm of the dictionary storehouse, the matching algorithm case auxiliary words block and the decompression algorithm of syncopate-dimensional software.

Key words: Tibetan corpus, segmentation and tag, segmentation dictionary, decompression algorithm

才藏太. 藏文语料库深加工方法研究[J]. 计算机工程与应用, 2012, 48(26): 127-130.

CAI Zangtai. Method study of deeper processing for Tibetan corpus[J]. Computer Engineering and Applications, 2012, 48(26): 127-130.

[1]	陈莉，古丽拉·阿东别克. 基于HMM的柯尔克孜语词性标注的研究[J]. 计算机工程与应用, 2014, 50(15): 120-124.
[2]	才让加. 藏语语料库加工方法研究[J]. 计算机工程与应用, 2011, 47(6): 138-139.
[3]	魏进，常朝稳. 单数组全映射分词词典[J]. 计算机工程与应用, 2007, 43(23): 184-186.

藏文语料库深加工方法研究

Method study of deeper processing for Tibetan corpus

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

编辑推荐

Metrics