Segmentation of Tibetan abbreviated forms based on word position

Computer Engineering and Applications ›› 2014, Vol. 50 ›› Issue (11): 218-222.

Previous Articles Next Articles

Segmentation of Tibetan abbreviated forms based on word position

KANG Caijun1, LONG Congjun2，3, JIANG Di1，2

1.College of Humanities and Communications, Shanghai Normal University, Shanghai 200234, China
2.Institute of Ethnology & Anthropology, Chinese Academy of Social Sciences, Beijing 100081, China
3.National Languages Resource Monitoring & Research Center of Minority Language Branch, Minzu University of China, Beijing 100081, China

Online:2014-06-01 Published:2015-04-08

基于词位的藏文黏写形式的切分

康才畯1，龙从军2，3，江荻1，2

1.上海师范大学人文与传播学院，上海 200234
2.中国社科院民族研究所，北京 100081
3.中央民族大学民族语言监测分中心，北京 100081

Abstract

Abstract: The best feature of segmentation of Tibetan abbreviated forms based on word position is reducing the negative effects of unknown words. This article improves 4 word-position tag set to 6 word-position tag set to fit in with the characters of Tibetan, uses CRF as tagging model to train and test corpus data, then builds a rule base to post process the result data. The experimental result shows that the method has a high recognition rate and deserves further study. The next steps are to expand the corpus and optimize the feature template.

Key words: Tibetan abbreviated forms, word position, Conditional Random Field（CRF）, feature template, post process

摘要： 基于词位的统计分析方法识别并切分现代藏语文本中的黏写形式，其最大特点是减少了未登录词对识别效果的影响。首先根据藏文自身的特点，将常用的四词位扩充为六词位，再利用条件随机场模型作为标注建模工具来进行训练和测试，并根据规则对识别结果进行后处理。从实验结果来看，该方法有较高的识别正确率，具有进一步研究的价值。下一步的改进需要扩充训练语料，并对模型选用的特征集进行优化。

关键词: 藏文黏写形式, 词位, 条件随机场, 特征模板, 后处理

KANG Caijun1, LONG Congjun2，3, JIANG Di1，2. Segmentation of Tibetan abbreviated forms based on word position[J]. Computer Engineering and Applications, 2014, 50(11): 218-222.

康才畯1，龙从军2，3，江荻1，2. 基于词位的藏文黏写形式的切分[J]. 计算机工程与应用, 2014, 50(11): 218-222.

[1]	TIAN Zihan, LI Xin. Research on Chinese Event Detection Method Based on BERT-CRF Model [J]. Computer Engineering and Applications, 2021, 57(11): 135-139.
[2]	LIU Xiaoan, PENG Tao. Research on Chinese Scenic Spot Named Entity Recognition Based on Convolutional Neural Network [J]. Computer Engineering and Applications, 2020, 56(4): 140-145.
[3]	Guljamal Mamateli1, Askar rozi2, Askar Hamdulla3. Uyghur prosodic boundary prediction based on hierarchical feature template selection [J]. Computer Engineering and Applications, 2017, 53(8): 250-253.
[4]	DU Yulong, LI Jianzeng, ZHANG Yan, FAN Cong. Saliency detection based on deep cross CNN and non-interaction GrabCut [J]. Computer Engineering and Applications, 2017, 53(3): 32-40.
[5]	ZHU Yanhui, LIU Jing, XU Yeqiang, TIAN Hailong, MA Jin. Chinese word segmentation research based on Conditional Random Field [J]. Computer Engineering and Applications, 2016, 52(15): 97-100.
[6]	KANG Caijun1, LONG Congjun2, JIANG Di1，2. Tibetan names recognition research based on CRF [J]. Computer Engineering and Applications, 2015, 51(3): 109-111.
[7]	GU Jingjing, ZHOU Guodong. Chinese comma classification based on segmentation and part of speech tagging [J]. Computer Engineering and Applications, 2015, 51(18): 120-125.
[8]	SHI Shuicai1，2, WANG Kai1, HAN Yanhua1，2, LV Xueqiang1，2. Terminology recognition based on conditional random fields [J]. Computer Engineering and Applications, 2013, 49(10): 147-149.
[9]	YU Jiangde¹，WANG Xijie¹，FAN Xiaozhong². Comparing of importance of above-context versus below-context for Chinese word segmentation [J]. Computer Engineering and Applications, 2011, 47(4): 117-120.
[10]	LIU Fangzhou1，TAO Jianhua2. Automatic feature template generation for intonational phrase prediction [J]. Computer Engineering and Applications, 2011, 47(16): 19-21.

Segmentation of Tibetan abbreviated forms based on word position

基于词位的藏文黏写形式的切分

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 10

Recommended Articles

Metrics