Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (8): 147-155.DOI: 10.3778/j.issn.1002-8331.2009-0422

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Research on Multi-Dimensional End-to-End Phrase Recognition Algorithm Based on Background Knowledge

LIU Guang, TU Gang, LI Zheng, LIU Yijian, ZHAN Zhiqiang   

  1. School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
  • Online:2022-04-15 Published:2022-04-15

支持背景知识的多维端到端短语识别算法研究

刘广,涂刚,李政,刘译键,占志强   

  1. 华中科技大学 计算机科学与技术学院,武汉 430074

Abstract: At present, the deep end-to-end method based on supervised learning is mainly used in entity recognition and dependency analysis. There are two problems in this method:firstly, background knowledge cannot be introduced; secondly, multi-granularity and nested features of natural language cannot be recognized. In order to solve the above problems, this paper proposes a dependency syntax annotation rule based on phrase window, labels the Chinese phrase window data set(CPWD), and designs a supporting multi-dimensional end-to-end phrase recognition model(MDM model). The rule takes phrase as the minimum unit, divides sentences into seven nested phrase types, and indicates the dependency between phrases. MDM model can not only introduce background knowledge, recognize various nested phrases in sentences, but also recognize the dependency between phrases. The experimental results show that the annotation rule is easy to use and has no ambiguity. At the same time, the MDM model can deal with the problem of phrase nesting more effectively than the traditional end-to-end algorithm. The experiment on CPWD dataset shows that the MDM model can improve the [F1] value by more than 1 percentage point compared with the end-to-end method. The corresponding method is applied to the Chinese Metaphorical Emotion Analysis Competition of CCL2018, which improves by more than 1 percentage point and wins the first place.

Key words: natural language processing, annotation system, phrase recognition, dependency analysis

摘要: 目前,实体识别与依存关系分析,采用的主要是基于监督学习的深度端到端方法。这种方法存在两个问题:不能引入背景知识;不能识别出自然语言的多粒度、嵌套特征。为了解决以上问题,提出了基于短语窗口的依存句法标注规则,并标注了中文短语窗口数据集(CPWD),同时设计了配套的多维端到端短语识别模型(MDM模型)。该标注规则以短语为最小单位,把句子分成7类可嵌套的短语类型,同时标示出短语之间的依存关系。MDM模型不仅可以引入背景知识,识别出句子中的各类嵌套短语,而且可以识别出短语之间的依存关系。实验结果表明,该标注规则方便易用。同时,MDM模型比传统端到端算法能更有效地处理短语嵌套的问题。在CPWD数据集上实验,MDM模型比端到端方法在[F1]值上提高1个百分点以上。相应的方法应用到了CCL2018的中文隐喻情感分析比赛中,在原有基础上提升了1个百分点以上,并取得第一名成绩。

关键词: 自然语言处理, 标注体系, 短语识别, 依存分析