Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (25): 125-128.DOI: 10.3778/j.issn.1002-8331.2009.25.038

• 数据库、信息处理 • Previous Articles     Next Articles

Web content extraction method based on logic lines and maximum admitting distances

ZHANG Xia-liang1,CHEN Jia-jun2   

  1. 1.Software Institute,Nanjing University,Nanjing 210000,China
    2.State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210093,China
  • Received:2008-10-22 Revised:2008-12-29 Online:2009-09-01 Published:2009-09-01
  • Contact: ZHANG Xia-liang

基于逻辑行和最大接纳距离的网页正文抽取

张霞亮1,陈家骏2   

  1. 1.南京大学 软件学院,南京 210093
    2.南京大学 计算机软件新技术国家重点实验室,南京 210093
  • 通讯作者: 张霞亮

Abstract: The content extraction for Web pages is a basic work to many Web applications and has to be solved well.The mainstream methods are based on the DOM trees and they need to parse out the DOM tree structures.For there are so many sources of Web pages in current Internet and their structures vary,the methods based on DOM trees may face the problem of low extraction precision and the shortage of performance.Aiming at these problems,this pager proposes a new method to extract the contents of Web pages.This method does not rely on DOM trees.It applies some heuristic rules formed by people’s habits when writing Web pages,combined with some relevant statistics laws.It extracts the contents of Web pages by taking the logic lines as the basic process units and using maximum admitting distances to decide the final contents of Web pages.Experiments show that this method can extract Web contents quickly and accurately.

Key words: information extraction, Web content, logic lines, heuristic rules, maximum admitting distances

摘要: 网页正文抽取是很多互联网应用的基础工作和必须解决的问题。目前的主流方法是基于DOM树结构,此方法需要解析出网页的DOM树结构。对于目前互联网上的网页来源众多、结构众多的情形,基于DOM树的处理方法除了性能不足以外,还会遇到抽取精度上的问题。针对这些问题,该文提出了一个网页正文抽取的新方法,该方法不依赖DOM树,而是考虑人们编写网页的方式形成一些启发式规则,并结合相关的统计规律,以逻辑行为基本处理单位,基于最大接纳距离进行网页正文抽取。实验表明,论文的方法能够高效、高精度地抽取出网页正文。

关键词: 信息抽取, 网页正文, 逻辑行, 启发式规则, 最大接纳距离

CLC Number: