计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (4): 117-120.DOI: 10.3778/j.issn.1002-8331.2011.04.032

• 数据库、信号与信息处理 • 上一篇    下一篇

汉语分词中上文和下文重要性比较

于江德1,王希杰1,樊孝忠2   

  1. 1.安阳师范学院 计算机与信息工程学院,河南 安阳 455002
    2.北京理工大学 计算机科学技术学院,北京 100081
  • 收稿日期:2010-06-21 修回日期:2010-08-25 出版日期:2011-02-01 发布日期:2011-02-01
  • 通讯作者: 于江德

Comparing of importance of above-context versus below-context for Chinese word segmentation

YU Jiangde1,WANG Xijie1,FAN Xiaozhong2   

  1. 1.School of Computer and Information Engineering,Anyang Normal University,Anyang,Henan 455002,China
    2.School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China
  • Received:2010-06-21 Revised:2010-08-25 Online:2011-02-01 Published:2011-02-01
  • Contact: YU Jiangde

摘要: 上下文是统计语言学中获取语言知识和解决自然语言处理中多种实际应用问题必须依靠的资源和基础。近年来基于字的词位标注的方法极大地提高了汉语分词的性能,该方法将汉语分词转化为字的词位标注问题,当前字的词位标注需要借助于该字的上下文来确定。为克服仅凭主观经验给出猜测结果的不足,采用四词位标注集,使用条件随机场模型研究了词位标注汉语分词中上文和下文对分词性能的贡献情况,在国际汉语分词评测Bakeoff2005的PKU和MSRA两种语料上进行了封闭测试,采用分别表征上文和下文的特征模板集进行了对比实验,结果表明,下文对分词性能的贡献比上文的贡献高出13个百分点以上。

关键词: 汉语分词, 上下文, 条件随机场, 词位标注, 特征模板

Abstract: Context is the necessary resource not only for obtaining linguistic knowledge in statistical linguistics but also for solving the problem in natural language processing.The performance of Chinese word segmentation has been greatly improved by word-position-based approaches in recent years.This approach treats Chinese word segmentation as a word-position tagging problem.To tag the word-position of current character needs the help of correlative context.To overcome the lack of giving the result by the subjective experience,this paper studies the contribution of above and below for Chinese word segmentation via using four word-positions and conditional random fields.Closed evaluations are performed on PKU and MSRA corpus from the second international Chinese word segmentation Bakeoff-2005,and comparative experiments are performed on different feature templates.Experimental results show that the performance by the below-context increases 13 percentage points than by the above-context.

Key words: Chinese word segmentation, context, conditional random fields, word-position tagging, feature template

中图分类号: