Computer Engineering and Applications ›› 2014, Vol. 50 ›› Issue (11): 218-222.

Previous Articles     Next Articles

Segmentation of Tibetan abbreviated forms based on word position

KANG Caijun1, LONG Congjun2,3, JIANG Di1,2   

  1. 1.College of Humanities and Communications, Shanghai Normal University, Shanghai 200234, China
    2.Institute of Ethnology & Anthropology, Chinese Academy of Social Sciences, Beijing 100081, China
    3.National Languages Resource Monitoring & Research Center of Minority Language Branch, Minzu University of China, Beijing 100081, China
  • Online:2014-06-01 Published:2015-04-08

基于词位的藏文黏写形式的切分

康才畯1,龙从军2,3,江  荻1,2   

  1. 1.上海师范大学 人文与传播学院,上海 200234
    2.中国社科院 民族研究所,北京 100081
    3.中央民族大学 民族语言监测分中心,北京 100081

Abstract: The best feature of segmentation of Tibetan abbreviated forms based on word position is reducing the negative effects of unknown words. This article improves 4 word-position tag set to 6 word-position tag set to fit in with the characters of Tibetan, uses CRF as tagging model to train and test corpus data, then builds a rule base to post process the result data. The experimental result shows that the method has a high recognition rate and deserves further study. The next steps are to expand the corpus and optimize the feature template.

Key words: Tibetan abbreviated forms, word position, Conditional Random Field(CRF), feature template, post process

摘要: 基于词位的统计分析方法识别并切分现代藏语文本中的黏写形式,其最大特点是减少了未登录词对识别效果的影响。首先根据藏文自身的特点,将常用的四词位扩充为六词位,再利用条件随机场模型作为标注建模工具来进行训练和测试,并根据规则对识别结果进行后处理。从实验结果来看,该方法有较高的识别正确率,具有进一步研究的价值。下一步的改进需要扩充训练语料,并对模型选用的特征集进行优化。

关键词: 藏文黏写形式, 词位, 条件随机场, 特征模板, 后处理