Research on machine translation based Uyghur morphological analysis

doi:10.3778/j.issn.1002-8331.1604-0119

Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (14): 138-142.DOI: 10.3778/j.issn.1002-8331.1604-0119

Previous Articles Next Articles

Research on machine translation based Uyghur morphological analysis

XU Chun1，2，3, YANG Yong4, JIANG Tonghai1

1.Xinjiang Technical Institute of Physics ＆ Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
2.University of Chinese Academy of Sciences, Beijing 100049, China
3.College of Computer Science and Engineering, Xinjiang University of Finance and Economics, Urumqi 830012, China
4.College of Computer Science and Technology, Xinjiang Normal University, Urumqi 830054, China

Online:2017-07-15 Published:2017-08-01

基于机器翻译的维吾尔语形态分析研究

徐春1，2，3，杨勇4，蒋同海1

1.中国科学院新疆理化技术研究所，乌鲁木齐 830011
2.中国科学院大学，北京100049
3.新疆财经大学计算机科学与工程学院，乌鲁木齐 830012
4.新疆师范大学计算机科学技术学院，乌鲁木齐 830054

Abstract

Abstract: To alleviate the data sparseness and reduce the complexity of models construction in Uyghur morphology analysis, it proposes a Statistical Machine Translation （SMT） based morphology analysis model, which considers the pre-stem words （pre-Part-Of-Speech （POS） tagging） in Uyghur word stemming （POS tagging） as the source part of SMT system, and the post-stemming words （POS tags） as the target part. To optimize the model, it uses dictionaries and joint validation in the model. Experimental results show that, the approach outperforms other systems in Uyghur word stemming and part-of-speech tagging. Compared with segmentation and POS tagging tasks in English and Chinese, the approach is more suitable for Uyghur.

Key words: Uyghur morphology analysis, machine translation based, word stemming, part-of-speech tagging, model optimization

摘要： 针对现有维吾尔语形态分析研究中存在的数据稀疏、模型构建复杂等问题，提出一种基于机器翻译的维吾尔语形态分析模型，即将维吾尔语词干提取（词性标注）任务中词干提取前（词性标注前）的句子看作是机器翻译模型训练过程中的源语言端，词干提取后（词性标注后）的句子看作是目标语言端；为了达到最佳的效果，加入了外部信息模块和联合校验模块以优化模型。实验结果表明，基于机器翻译框架的维吾尔语形态分析模型在词干提取、词性标注两个任务上优于其他模型。对比英语（词干提取、词性标注）、汉语（分词、词性标注）实验结果，提出的方法更适合维吾尔语形态分析。

关键词: 维吾尔语形态分析, 基于机器翻译, 词干提取, 词性标注, 模型优化

XU Chun1，2，3, YANG Yong4, JIANG Tonghai1. Research on machine translation based Uyghur morphological analysis[J]. Computer Engineering and Applications, 2017, 53(14): 138-142.

徐春1，2，3，杨勇4，蒋同海1. 基于机器翻译的维吾尔语形态分析研究[J]. 计算机工程与应用, 2017, 53(14): 138-142.

[1]	JIANG Fang1，2, LI Guohe1，2，3, YUE Xiang4, WU Weijiang1，2，3, HONG Yunfeng3, LIU Zhiyuan3, CHENG Yuan3. Segmentation of Chinese word based on method of rough segment and part of speech tagging [J]. Computer Engineering and Applications, 2015, 51(6): 204-207.
[2]	LIU Jiajia, WANG Kai, YUAN Jianying, JIANG Xiaoliang, LI Bailin. Optimization of RBF-SVM model in railway fastener detection system [J]. Computer Engineering and Applications, 2014, 50(15): 30-33.
[3]	CHEN Li, Gulila·ALTENBEK. Research on Kirgiz language part of speech tagging based on HMM [J]. Computer Engineering and Applications, 2014, 50(15): 120-124.
[4]	SANG Haiyan1，2, Gulia·Altenbek1，2, NIU Ningning1，2. Kazakh part-of-speech tagging method based on maximum entropy [J]. Computer Engineering and Applications, 2013, 49(11): 126-129.
[5]	NIJAT Najmidin1，2, MAHMUD Mamat3, TURGUN Ibrahim4. Experimental study of N-gram based Uyghur part of speech tagging [J]. Computer Engineering and Applications, 2012, 48(25): 137-140.
[6]	HOU Cheng-feng，Gulila·Altenbek. Improved hidden Markov models used in Kazakh part-of-speech tagging [J]. Computer Engineering and Applications, 2010, 46(36): 147-149.
[7]	WANG Yong-sheng. Research on part-of-speech tagging using decision trees in English-Chinese machine translation system [J]. Computer Engineering and Applications, 2010, 46(20): 99-102.
[8]	LIU Yan,GULILA.Altenbek,Yiliyaer. Preliminary study on Kazak Part-of-Speech automatic tagging [J]. Computer Engineering and Applications, 2008, 44(20): 242-244.

Research on machine translation based Uyghur morphological analysis

基于机器翻译的维吾尔语形态分析研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 8

Recommended Articles

Metrics