计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (21): 5-8.DOI: 10.3778/j.issn.1002-8331.2008.21.002

• 博士论坛 • 上一篇    下一篇

基于混合模型的交集型歧义消歧策略

李天侠,戴新宇,陈家骏   

  1. 南京大学 计算机软件新技术国家重点实验室,南京 210093
    南京大学 计算机科学与技术系,南京 210093
  • 收稿日期:2008-04-30 修回日期:2008-06-02 出版日期:2008-07-21 发布日期:2008-07-21
  • 通讯作者: 李天侠

Hybrid model for overlapping ambiguities resolution

LI Tian-xia,DAI Xin-yu,CHEN Jia-jun   

  1. National Laboratory of Novel Software Technology,Nanjing University,Nanjing 210093,China
    Department of Computer Science and Technology,Nanjing University,Nanjing 210093,China
  • Received:2008-04-30 Revised:2008-06-02 Online:2008-07-21 Published:2008-07-21
  • Contact: LI Tian-xia

摘要: 针对交集型歧义这一汉语分词中的难点问题,提出了一种规则和统计相结合的交集型歧义消歧模型。首先,根据标注语料库,通过基于错误驱动的学习思想,获取交集型歧义消歧规则库,同时,利用统计工具,构建N-Gram统计语言模型;然后,采用正向/逆向最大匹配方法和消歧规则库探测发现交集型歧义字段;最后,通过消歧规则库和评分函数进行交集型歧义的消歧处理。这种基于混合模型的方法可以探测到更多的交集型歧义字段,并且结合了规则方法和统计方法在处理交集型歧义上的优势。实验表明,这种方法提高了交集型歧义处理的精度,为解决交集型歧义提供了一种新的思路。

关键词: 交集型歧义, 消歧规则, 统计语言模型, 评分函数, 全切分

Abstract: Overlapping ambiguity is one of the key problems in Chinese words segmentation.In this paper,a new hybrid strategy which integrates rule-based method and statistical-based method is presented for solving the overlapping ambiguity.Firstly,rule-set is constructed automatically through error-driven learning which will be used for some ambiguities detection and resolution.Secondly,a score function based on N-Gram language model is constructed.Lastly,a rule-based module and a statistical-based module will be combined for solving all ambiguities detected by FMM&BMM and the rule-set.The experiments show that this hybrid method is more suitable for ambiguities detection and possesses the advantages of both rule-based and statistical-based methods for overlapping ambiguities resolution in Chinese words segmentation.

Key words: overlapping ambiguity, disambiguation rules, statistical language model, score function, full segmentation