计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (31): 125-127.

• 数据库、信号与信息处理 • 上一篇    下一篇

中文分词中组合型切分歧义的消解研究

尤慧丽,晏 立,杨晓东   

  1. 江苏大学 计算机科学与通信工程学院,江苏 镇江 212013
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-11-01 发布日期:2011-11-01

Research on combinational ambiguity strings in Chinese word segmentation

YOU Huili,YAN Li,YANG Xiaodong   

  1. School of Computer Science and Telecommunication Engineering,Jiangsu University,Zhenjiang,Jiangsu 212013,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-11-01 Published:2011-11-01

摘要: 针对中文自动分词中组合型歧义消解难的问题,提出了一种新的切分算法来对组合型歧义字段进行消歧。该算法首先自动从训练语料中提取歧义字段的上下文信息来建立规则库,然后利用C-SVM模型结合规则对组合型歧义字段进行歧义消解。最后以1998年1月《人民日报》语料中出现的组合型歧义字段进行训练和测试,实验显示消歧的平均准确率达89.33%。

关键词: 中文自动分词, 组合型歧义, 上下文信息, C-支持向量机

Abstract: Combinational ambiguity is one of the most difficult problems in Chinese word segmentation.The paper discusses a new segmentation algorithm to solve the combinational ambiguity.The algorithm automatically extracts contextual information of the combinational ambiguity to establish rules,then uses the C-SVM model and these rules to solve the combinational ambiguity.The People Daily corpus of January 1998 is used in training and testing,and the average accuracy is 89.33%.

Key words: Chinese word segmentation, combinational ambiguity, contextual information, C-Support Vector Machine(C-SVM)