Computer Engineering and Applications ›› 2013, Vol. 49 ›› Issue (1): 171-175.

Previous Articles     Next Articles

Uyghur noun stemming system based on hybrid method

ZAOKERE Kadeer1,2, AISHAN Wumaier1,2, TUERGEN Yibulayin1,2, PARIDA Tursun2,3, WU Xiaochuan1,2   

  1. 1.School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
    2.Xinjiang Laboratory of Multi-language Information Technology, Urumqi 830046, China
    3.School of Software, Xinjiang University, Urumqi 830046, China
  • Online:2013-01-01 Published:2013-01-16

混合策略的维吾尔语名词词干提取系统

早克热·卡德尔1,2,艾山·吾买尔1,2,吐尔根·依布拉音1,2,帕里旦·吐尔逊2,3,吴小川1,2   

  1. 1.新疆大学 信息科学与工程学院,乌鲁木齐 830046
    2.新疆多语种信息技术重点实验室,乌鲁木齐 830046
    3.新疆大学 软件学院,乌鲁木齐 830046

Abstract: This paper researches on Uyghur noun stemming. Uyghur noun morphology has been studied, and generates Finite State Machine(FSM). The errors of FSM is studied. And according to the features of the errors, the FSM integrates with maximum entropy model to disambiguate the ambiguous suffixes. Finally, the noisy channel model is used to resolve the vowel neutralization. After establishing these three models, a rule and statistics based stemming method is proposed. In order to effectively make use of existing resources and improve system performance, dictionary-based approach is also integrated into the Uyghur noun stemming system. Thus, the system has a better performance and robustness, and the precision keeps over 95%.

Key words: Uyghur, agglutinative, Finite State Machine(FSM), noisy channe, stemming

摘要: 通过对维吾尔语名词形态结构进行研究,构造了名词有限状态自动机(FSM);针对自动机的缺陷使用最大熵模型给有限状态自动机加入了歧义词缀识别能力,根据维吾尔语的元音和谐特点,建立了基于规则和信道噪声模型的元音和谐处理方法。有机地结合以上三种方法构造出了基于规则和统计的名词词干提取方法。为了有效利用现有的资源,提高系统的性能,把基于词典的词干提取方法与规则和统计结合的名词词干提取方法相结合,从而开发出多种策略相结合的维吾尔语名词词干提取系统。该系统具有较强的鲁棒性,准确率保持95%以上。

关键词: 维吾尔语, 黏着语, 有限状态自动机, 噪声信道, 词干提取