计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (17): 33-36.DOI: 10.3778/j.issn.1002-8331.2009.17.010

• 博士论坛 • 上一篇    下一篇

二元语法中文分词数据平滑算法性能研究

刘 丹,方卫国,周 泓   

  1. 北京航空航天大学 经济管理学院,北京 100191
  • 收稿日期:2009-02-02 修回日期:2009-03-09 出版日期:2009-06-11 发布日期:2009-06-11
  • 通讯作者: 刘 丹

Performance of smoothing algorithm in Chinese word segmentation by bigram

LIU Dan,FANG Wei-guo,ZHOU Hong   

  1. School of Economy and Management,Beihang University,Beijing 100191,China
  • Received:2009-02-02 Revised:2009-03-09 Online:2009-06-11 Published:2009-06-11
  • Contact: LIU Dan

摘要: 将多种平滑算法应用于基于二元语法的中文分词,在1998年1月人民日报语料库的基础上,讨论了困惑度和实际分词性能之间的关系,对比分析各平滑算法的实际性能,结果表明,简单的加值平滑算法性能最优,封闭精度、召回率分别为99.68%、99.7%,开放精度、召回率为98.64%、98.74%。

关键词: 数据平滑, 中文分词, 二元语法

Abstract: This paper discusses the relationships between complexity and real performance based on the corpus of People’s Daily of January,1998,compares the performance of multiple smoothing algorithms.The result reveals that additive smoothing is the best with 99.68% on precision,99.7% on recall in close test,and 98.64% on precision,98.74% on recall in open test.

Key words: smoothing, Chinese word segmentation, bigram