计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (17): 140-145.

• 模式识别与人工智能 • 上一篇    下一篇

面向中文指代消解的最优样本比例研究

颜  晗,刘  娟,周炫余   

  1. 武汉大学 计算机学院,武汉 430072
  • 出版日期:2016-09-01 发布日期:2016-09-14

Optimal proportion of training data for Chinese coreference resolution

YAN Han, LIU Juan, ZHOU Xuanyu   

  1. School of Computer, Wuhan University, Wuhan 430072, China
  • Online:2016-09-01 Published:2016-09-14

摘要: 已有的中文指代消解系统研究大多是基于有监督的机器学习方法,训练集中正负例的比值直接影响到分类器模型,进而影响指代消解结果。针对如何选取训练集正负例比值的问题,实现了一个中文指代消解系统,提出了训练数据正负例比值与指代消解系统评测结果之间的数学模型,并引入一种改进的遗传算法计算训练数据最优比值,使系统评测结果最优。在ACE 2005中文语料上的实验表明,改进的遗传算法更适合指代消解任务,适当增大负例的比值能够提高指代消解系统的性能。

关键词: 指代消解, 训练数据, 遗传算法

Abstract: Most of the Chinese coreference resolution systems are based on supervised machine learning, proportion of positive and negative examples in the training data set greatly affects classifier performance. To determine the proportion of positive and negative examples in the system, a Chinese coreference resolution is implemented, a mathematical model of proportion of training data and evaluation of system is proposed, applying an improved genetic algorithm to solve the optimization model. Evaluation on ACE 2005 Chinese corpus shows that the improved algorithm is more effective and better performance can be achieved by increasing the negative examples.

Key words: coreference resolution, training data, Genetic Algorithm(GA)