计算机工程与应用 ›› 2010, Vol. 46 ›› Issue (4): 139-141.DOI: 10.3778/j.issn.1002-8331.2010.04.045

• 数据库、信号与信息处理 • 上一篇    下一篇

分词语料库中的并列式四字格识别

徐润华,陈小荷,李 斌   

  1. 南京师范大学 文学院,南京 210097
  • 收稿日期:2008-09-12 修回日期:2008-12-11 出版日期:2010-02-01 发布日期:2010-02-01
  • 通讯作者: 徐润华

Recognition of parallel four-character idioms in word-segmented corpora

XU Run-hua,CHEN Xiao-he,LI Bin   

  1. College of Liberal Arts,Nanjing Normal University,Nanjing 210097,China
  • Received:2008-09-12 Revised:2008-12-11 Online:2010-02-01 Published:2010-02-01
  • Contact: XU Run-hua

摘要: 并列式四字格是一种特殊却数量众多的四字格。介绍了在有词性标注语料库中基于条件随机场模型的四字格抽取工作,并在此基础上分析了并列式四字格的结构特点,提出了一种基于分词语料库环境的并列式四字格识别方法。通过不同语料库间的对比实验,结果表明该识别方法具有比较好的精确度和一定的适应性。

Abstract: Among all kinds of Chinese four-character idioms,the Parallel Four-Character Idiom(PFCI) is special and numerous.This paper introduces the research based on Conditional Random Fields(CRF) model which can retrieve PFCI from a POS-tagged corpus.The paper then analyzes the structural characteristics of PFCI and proposes an approach on recognizing PFCI in word-segmented corpora.By comparing its application on different corpora,the evaluation results show that this recognition approach maintains relatively high precision and good adaptability.

中图分类号: