计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (11): 218-222.

• 信号处理 • 上一篇    下一篇

基于词位的藏文黏写形式的切分

康才畯1,龙从军2,3,江  荻1,2   

  1. 1.上海师范大学 人文与传播学院,上海 200234
    2.中国社科院 民族研究所,北京 100081
    3.中央民族大学 民族语言监测分中心,北京 100081
  • 出版日期:2014-06-01 发布日期:2015-04-08

Segmentation of Tibetan abbreviated forms based on word position

KANG Caijun1, LONG Congjun2,3, JIANG Di1,2   

  1. 1.College of Humanities and Communications, Shanghai Normal University, Shanghai 200234, China
    2.Institute of Ethnology & Anthropology, Chinese Academy of Social Sciences, Beijing 100081, China
    3.National Languages Resource Monitoring & Research Center of Minority Language Branch, Minzu University of China, Beijing 100081, China
  • Online:2014-06-01 Published:2015-04-08

摘要: 基于词位的统计分析方法识别并切分现代藏语文本中的黏写形式,其最大特点是减少了未登录词对识别效果的影响。首先根据藏文自身的特点,将常用的四词位扩充为六词位,再利用条件随机场模型作为标注建模工具来进行训练和测试,并根据规则对识别结果进行后处理。从实验结果来看,该方法有较高的识别正确率,具有进一步研究的价值。下一步的改进需要扩充训练语料,并对模型选用的特征集进行优化。

关键词: 藏文黏写形式, 词位, 条件随机场, 特征模板, 后处理

Abstract: The best feature of segmentation of Tibetan abbreviated forms based on word position is reducing the negative effects of unknown words. This article improves 4 word-position tag set to 6 word-position tag set to fit in with the characters of Tibetan, uses CRF as tagging model to train and test corpus data, then builds a rule base to post process the result data. The experimental result shows that the method has a high recognition rate and deserves further study. The next steps are to expand the corpus and optimize the feature template.

Key words: Tibetan abbreviated forms, word position, Conditional Random Field(CRF), feature template, post process