计算机工程与应用 ›› 2010, Vol. 46 ›› Issue (16): 153-156.DOI: 10.3778/j.issn.1002-8331.2010.16.045

• 数据库、信号与信息处理 • 上一篇    下一篇

采用树自动机推理技术的信息抽取方法

谭鹏许,张来顺   

  1. 解放军信息工程大学 电子技术学院,郑州 450004
  • 收稿日期:2008-11-19 修回日期:2009-02-18 出版日期:2010-06-01 发布日期:2010-06-01
  • 通讯作者: 谭鹏许

Information extraction using tree automata inference technique

TAN Peng-xu,ZHANG Lai-shun   

  1. Institute of Electronic Technology,the PLA Information Engineering University,Zhengzhou 450004,China
  • Received:2008-11-19 Revised:2009-02-18 Online:2010-06-01 Published:2010-06-01
  • Contact: TAN Peng-xu

摘要: 提出了一种利用改进的k-contextual树自动机推理算法的信息抽取技术。其核心思想是将结构化(半结构化)文档转换成树,然后利用一种改进的k-contextual树(KLH树)来构造出能够接受样本的无秩树自动机,依据该自动机接收和拒绝状态来确定是否抽取网页信息。该方法充分利用了网页文档的树状结构,依托树自动机将传统的以单一结构途径的信息抽取方法与文法推理原则相结合,得到信息抽取规则。实验证明,该方法与同类抽取方法相比,样本学习时间以及抽取所需时间上均有所缩短。

关键词: 树自动机推理算法, 结构化(半结构化)文档, 无秩树自动机, 信息抽取, KLH树

Abstract: This paper proposes an information extraction method based on an improved k-contextual tree automata inference algorithm.The key idea is to transform (semi-) structured documents into tree,creating unranked tree automata which can accept the tree and extract data according to the unranked tree automata state of acceptance and rejection,using an advanced k-contextual tree language,which is called KLH tree language.The method makes full use of the tree structure of the web document and combines the method based on web structure with grammar inference.Experimental results show that the approach with tree automata inference is favorable against some other approach in the learning time and extraction time.

Key words: tree automata inference algorithm, (semi-)structured documents, unranked tree automata, information extraction, KLH tree language

中图分类号: