计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (3): 109-111.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于条件随机场的藏文人名识别研究

康才畯1,龙从军2,江  荻1,2   

  1. 1.上海师范大学 人文与传播学院,上海 200234
    2.中国社科院 民族研究所,北京 100081
  • 出版日期:2015-02-01 发布日期:2015-01-28

Tibetan names recognition research based on CRF

KANG Caijun1, LONG Congjun2, JIANG Di1,2   

  1. 1.Humanities and Communications College, Shanghai Normal University, Shanghai 200234, China
    2.Institute of Ethnology & Anthropology, Chinese Academy of Social Sciences, Beijing 100081, China
  • Online:2015-02-01 Published:2015-01-28

摘要: 基于条件随机场模型在字粒度上识别并切分藏文人名,其优势是可以较好地利用藏文人名在文本中出现的基本特征和上下文特征来确定藏文人名在文本序列中的边界。根据藏文人名自身的特点设定特征标签集,利用条件随机场模型作为标注建模工具来进行训练和测试。从实验结果来看,该方法有较高的识别正确率,具有进一步研究的价值。下一步的改进需要扩充训练语料,并针对人名与一般词语同形现象进行特征标签集的优化。

关键词: 藏文人名, 条件随机场, 特征标签集

Abstract: The best feature of segmentation of Tibetan names based on Conditional Random Field(CRF) on the character level is making use of the basic and context features of the Tibetan names. This paper defines a feature tag set to fit in with the characters of Tibetan names, uses CRF as tagging model to train and test corpus data. The experimental result shows that the method has a high recognition rate and deserves further study. The next steps are to expand the corpus and optimize the tag set for the isomorphic phenomena of Tibetan names and general words.

Key words: Tibetan name, Conditional Random Field(CRF), tag set