计算机工程与应用 ›› 2010, Vol. 46 ›› Issue (18): 138-141.DOI: 10.3778/j.issn.1002-8331.2010.18.044

• 数据库、信号与信息处理 • 上一篇    下一篇

以“的”字结构为核心的最长名词短语识别研究

钱小飞   

  1. 中国传媒大学 文学院,北京 100024
  • 收稿日期:2008-12-23 修回日期:2009-03-13 出版日期:2010-06-21 发布日期:2010-06-21
  • 通讯作者: 钱小飞

Recognition of MNP with “De-Phrase” core

QIAN Xiao-fei   

  1. School of Chinese Language and Literature,Communication University of China,Beijing 100024,China
  • Received:2008-12-23 Revised:2009-03-13 Online:2010-06-21 Published:2010-06-21
  • Contact: QIAN Xiao-fei

摘要: 以“的”字结构为核心的最长名词短语是汉语最长名词短语的一个特殊子类。以该短语的自动识别为基础重新分化了汉语MNP的识别任务。在考察其结构和分布特征的基础上,提出“先识别右边界,识别成果参与左边界识别”的策略,并使用边界分布概率模型分治了左右边界。实验基于85万字的新闻语料上进行训练,并在42万字的同质语料上进行了开放测试,取得了80.63%的正确率和75.68%的召回率。

关键词: 最长名词短语, “的”字结构, 识别, 浅层句法分析

Abstract: The MNP with “De-Phrase” core is a special subclass of MNP.The identification of the phrase in this paper gives a new subsumption to the task of Chinese MNP recognition.The paper first analyzes the distribution and the structure feature of the phrase,then it advances a strategy of “Identify the right boundary first,then identify the left one”.Furthermore,it adopts the method “Boundary Distribution Probability” to recognize the phrase.A corpus(about 0.85 million Chinese Characters) of news is used for the automatic identification training and another(about 0.42 million Chinese Characters) is used for test,and the experiment achieves 80.63% in precision and 75.68% in recall.

Key words: Maximal Noun Phrase(MNP), De-Phrase, identification, shallow parsing

中图分类号: