Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (6): 165-167.DOI: 10.3778/j.issn.1002-8331.2009.06.046

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Semi-structure page information extraction algorithm with automatic granularity selection

WANG Xiao-bin,WANG Peng-po,SHI Zhao-xiang   

  1. Department of Network Engineering,Electronic Engineering Institute,Hefei 230037,China
  • Received:2008-01-14 Revised:2008-04-15 Online:2009-02-21 Published:2009-02-21
  • Contact: WANG Xiao-bin

自动粒度选择的半结构化页面信息抽取

王晓斌,王鹏坡,石昭祥   

  1. 解放军电子工程学院 网络工程系 602教研室,合肥 230037
  • 通讯作者: 王晓斌

Abstract: Data records of simi-structure Web page are similar in structure.This virtue represents as repeat tag strings in the tag sequence of first order traversing DOM tree,which generally can be mined through constructing a suffix-tree.Since the tag sequence can be generated both in block tag level and text tag level,and the different granularity patterns’ performances play an uncertain way,a semi-structure information extraction algorithm with automatic granularity selection is introduced in this paper.Firstly,it generates two different granularity candidate pattern collections by search maximal repeats and tandem repeats in respective suffix-trees,and then evaluates these patterns by statistic metrics.A new metric of extraction result regularity and a weighted approach are introduced for selecting the target pattern.

摘要: 半结构化页面的数据记录间存在结构相似性,在先序遍历DOM树生成的标记序列中表现为重复出现的模式,可利用后缀树进行挖掘。由于标记序列可以在块粒度和文本粒度两个级别上展现,而不同粒度下产生的最佳抽取模式在抽取效果方面又表现出不确定性,因此提出一种自动粒度选择的半结构化页面信息抽取方法。算法从后缀树获取的重复模式中选取最大重复和串联重复构成候选模式集,通过特征参数确定两个粒度各自的最佳模式集,最后引入抽取结果规则度参数并进行综合评价,以确定抽取模式完成半结构化页面数据记录的自动抽取。