Web content extraction method based on logic lines and maximum admitting distances

doi:10.3778/j.issn.1002-8331.2009.25.038

Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (25): 125-128.DOI: 10.3778/j.issn.1002-8331.2009.25.038

• 数据库、信息处理 • Previous Articles Next Articles

Web content extraction method based on logic lines and maximum admitting distances

ZHANG Xia-liang¹，CHEN Jia-jun²

1.Software Institute，Nanjing University，Nanjing 210000，China
2.State Key Laboratory for Novel Software Technology，Nanjing University，Nanjing 210093，China

Received:2008-10-22 Revised:2008-12-29 Online:2009-09-01 Published:2009-09-01
Contact: ZHANG Xia-liang

基于逻辑行和最大接纳距离的网页正文抽取

张霞亮¹，陈家骏²

1.南京大学软件学院，南京 210093
2.南京大学计算机软件新技术国家重点实验室，南京 210093

通讯作者: 张霞亮

Abstract

Abstract: The content extraction for Web pages is a basic work to many Web applications and has to be solved well.The mainstream methods are based on the DOM trees and they need to parse out the DOM tree structures.For there are so many sources of Web pages in current Internet and their structures vary，the methods based on DOM trees may face the problem of low extraction precision and the shortage of performance.Aiming at these problems，this pager proposes a new method to extract the contents of Web pages.This method does not rely on DOM trees.It applies some heuristic rules formed by people’s habits when writing Web pages，combined with some relevant statistics laws.It extracts the contents of Web pages by taking the logic lines as the basic process units and using maximum admitting distances to decide the final contents of Web pages.Experiments show that this method can extract Web contents quickly and accurately.

Key words: information extraction, Web content, logic lines, heuristic rules, maximum admitting distances

摘要： 网页正文抽取是很多互联网应用的基础工作和必须解决的问题。目前的主流方法是基于DOM树结构，此方法需要解析出网页的DOM树结构。对于目前互联网上的网页来源众多、结构众多的情形，基于DOM树的处理方法除了性能不足以外，还会遇到抽取精度上的问题。针对这些问题，该文提出了一个网页正文抽取的新方法，该方法不依赖DOM树，而是考虑人们编写网页的方式形成一些启发式规则，并结合相关的统计规律，以逻辑行为基本处理单位，基于最大接纳距离进行网页正文抽取。实验表明，论文的方法能够高效、高精度地抽取出网页正文。

关键词: 信息抽取, 网页正文, 逻辑行, 启发式规则, 最大接纳距离

CLC Number:

TP391

ZHANG Xia-liang¹，CHEN Jia-jun². Web content extraction method based on logic lines and maximum admitting distances[J]. Computer Engineering and Applications, 2009, 45(25): 125-128.

张霞亮¹，陈家骏². 基于逻辑行和最大接纳距离的网页正文抽取[J]. 计算机工程与应用, 2009, 45(25): 125-128.

[1]	WEI Hao, ZHOU Ai, ZHANG Yijia, CHEN Fei, QU Wen, LU Mingyu. Review of Deep Learning-Based Biomedical Entity Relation Extraction Research [J]. Computer Engineering and Applications, 2021, 57(21): 14-23.
[2]	WU Cheng, WANG Chaokun, WANG Muxian. Entity Attributes Extraction Based on Text Simplification [J]. Computer Engineering and Applications, 2020, 56(21): 115-122.
[3]	ZHANG Xiao, SUN Yiming, WU Xufeng. Research on Query-Aware Relation-Graph Database Adaptive Storage Technology [J]. Computer Engineering and Applications, 2020, 56(17): 100-108.
[4]	HUANG Cheng1，2, LIU Jiayong1, LIU Liang1, HE Xiang1, TANG Dianhua2. Research on extraction model of malicious domain corpus based on context semantics [J]. Computer Engineering and Applications, 2018, 54(9): 101-108.
[5]	WANG Haiyong, FENG Zhaoxu, YANG Haibo, ZHANG Jindong. Research on text extraction algorithm based on structure similarity page clustering [J]. Computer Engineering and Applications, 2018, 54(11): 122-127.
[6]	DU Boyuan1, WANG Meiqing1, CHEN Changfu2, CHEN Fei1. Tags extraction for Web information based on structure consistency and feature learning [J]. Computer Engineering and Applications, 2017, 53(7): 74-78.
[7]	QI Xiangming, SUN Wenxin. Research on pagerank algorithm based on multi-feature factor fusion [J]. Computer Engineering and Applications, 2017, 53(7): 97-103.
[8]	ZHAO Xiaoyong, WANG Lei. Product specification auto extract method of e-commerce websites [J]. Computer Engineering and Applications, 2017, 53(24): 168-171.
[9]	GU Nannan, FENG Jun, SUN Xia, ZHAO Yan, ZHANG Lei. Chinese resume information automatic extraction and recommendation algorithm [J]. Computer Engineering and Applications, 2017, 53(18): 141-148.
[10]	SUN Hongmin, JIANG Nannan, LI Xiang. Research on biological information mining model based on document set [J]. Computer Engineering and Applications, 2016, 52(24): 102-106.
[11]	LIU Lin1，2, ZHENG Jiang1. Improved biogeography-based optimization algorithm for flexible job-shop scheduling problem [J]. Computer Engineering and Applications, 2016, 52(18): 228-234.
[12]	YI Zheng, XU Wuping, XU Aiping. Discovery method of webpage subject area based on structural analysis [J]. Computer Engineering and Applications, 2015, 51(6): 227-230.
[13]	HUANG Yanjiao, WU Qin, LIANG Jiuzhen. Boosted constrained conditional random fields for Web object information extraction [J]. Computer Engineering and Applications, 2015, 51(23): 143-148.
[14]	QIAO Naosheng1, ZHANG Fen2. Method of defect image edge information extraction of printed circuit board [J]. Computer Engineering and Applications, 2015, 51(20): 11-15.
[15]	HOU Xiaoli, LIU Yong, JIANG Laizhen, GAO Xinqin. Multi-objective optimization method for flexible job-shop scheduling problems based on unidimensional-encoded particle swarm optimization [J]. Computer Engineering and Applications, 2015, 51(13): 47-51.

Web content extraction method based on logic lines and maximum admitting distances

基于逻辑行和最大接纳距离的网页正文抽取

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics