基于逻辑行和最大接纳距离的网页正文抽取

doi:10.3778/j.issn.1002-8331.2009.25.038

计算机工程与应用 ›› 2009, Vol. 45 ›› Issue (25): 125-128.DOI: 10.3778/j.issn.1002-8331.2009.25.038

基于逻辑行和最大接纳距离的网页正文抽取

张霞亮¹，陈家骏²

1.南京大学软件学院，南京 210093
2.南京大学计算机软件新技术国家重点实验室，南京 210093

收稿日期:2008-10-22 修回日期:2008-12-29 出版日期:2009-09-01 发布日期:2009-09-01
通讯作者: 张霞亮

Web content extraction method based on logic lines and maximum admitting distances

ZHANG Xia-liang¹，CHEN Jia-jun²

1.Software Institute，Nanjing University，Nanjing 210000，China
2.State Key Laboratory for Novel Software Technology，Nanjing University，Nanjing 210093，China

Received:2008-10-22 Revised:2008-12-29 Online:2009-09-01 Published:2009-09-01
Contact: ZHANG Xia-liang

摘要/Abstract

摘要： 网页正文抽取是很多互联网应用的基础工作和必须解决的问题。目前的主流方法是基于DOM树结构，此方法需要解析出网页的DOM树结构。对于目前互联网上的网页来源众多、结构众多的情形，基于DOM树的处理方法除了性能不足以外，还会遇到抽取精度上的问题。针对这些问题，该文提出了一个网页正文抽取的新方法，该方法不依赖DOM树，而是考虑人们编写网页的方式形成一些启发式规则，并结合相关的统计规律，以逻辑行为基本处理单位，基于最大接纳距离进行网页正文抽取。实验表明，论文的方法能够高效、高精度地抽取出网页正文。

关键词: 信息抽取, 网页正文, 逻辑行, 启发式规则, 最大接纳距离

Abstract: The content extraction for Web pages is a basic work to many Web applications and has to be solved well.The mainstream methods are based on the DOM trees and they need to parse out the DOM tree structures.For there are so many sources of Web pages in current Internet and their structures vary，the methods based on DOM trees may face the problem of low extraction precision and the shortage of performance.Aiming at these problems，this pager proposes a new method to extract the contents of Web pages.This method does not rely on DOM trees.It applies some heuristic rules formed by people’s habits when writing Web pages，combined with some relevant statistics laws.It extracts the contents of Web pages by taking the logic lines as the basic process units and using maximum admitting distances to decide the final contents of Web pages.Experiments show that this method can extract Web contents quickly and accurately.

Key words: information extraction, Web content, logic lines, heuristic rules, maximum admitting distances

中图分类号:

TP391

张霞亮¹，陈家骏². 基于逻辑行和最大接纳距离的网页正文抽取[J]. 计算机工程与应用, 2009, 45(25): 125-128.

ZHANG Xia-liang¹，CHEN Jia-jun². Web content extraction method based on logic lines and maximum admitting distances[J]. Computer Engineering and Applications, 2009, 45(25): 125-128.

[1]	隗昊，周爱，张益嘉，陈飞，屈雯，鲁明羽. 深度学习生物医学实体关系抽取研究综述[J]. 计算机工程与应用, 2021, 57(21): 14-23.
[2]	吴呈，王朝坤，王沐贤. 基于文本化简的实体属性抽取方法[J]. 计算机工程与应用, 2020, 56(21): 115-122.
[3]	张孝，孙一铭，吴旭峰. 查询感知的关系-图数据库自适应存储技术研究[J]. 计算机工程与应用, 2020, 56(17): 100-108.
[4]	赵晓永，王磊. 电商网页中商品规格信息自动抽取方法研究[J]. 计算机工程与应用, 2017, 53(24): 168-171.
[5]	谷楠楠，冯筠，孙霞，赵妍，张蕾. 中文简历自动解析及推荐算法[J]. 计算机工程与应用, 2017, 53(18): 141-148.
[6]	冯钦林，杨志豪，林鸿飞. 疾病-病症和病症-治疗物质的关系抽取研究[J]. 计算机工程与应用, 2017, 53(10): 251-257.
[7]	孙红敏，姜楠楠，李想. 基于文档集的生物信息挖掘模型研究[J]. 计算机工程与应用, 2016, 52(24): 102-106.
[8]	刘林1，2，郑江1. 改进生物地理学算法求解柔性作业调度问题[J]. 计算机工程与应用, 2016, 52(18): 228-234.
[9]	伊政，徐武平，徐爱萍. 一种基于结构分析的网页主题区域发现方法[J]. 计算机工程与应用, 2015, 51(6): 227-230.
[10]	黄彦姣，吴秦，梁久祯. 基于增强约束条件随机场的Web对象信息抽取[J]. 计算机工程与应用, 2015, 51(23): 143-148.
[11]	侯晓莉，刘永，江来臻，高新勤. 多目标FJSP的一维编码粒子群优化求解方法[J]. 计算机工程与应用, 2015, 51(13): 47-51.
[12]	张菲菲1，李宗海2，周晓辉1，李晓戈1,2. 基于层次聚类的跨文本中文人名消歧研究[J]. 计算机工程与应用, 2014, 50(6): 106-111.
[13]	昌磊1，陆阳1，吴雷1，2. PDF文档的跨终端发布技术[J]. 计算机工程与应用, 2014, 50(22): 158-162.
[14]	李嘉，徐前，王梓，陈钊. 基于语义的林产品贸易Web信息抽取算法[J]. 计算机工程与应用, 2014, 50(19): 199-204.
[15]	魏平，熊伟清. 求解强异类集装箱装载问题的混合蚁群算法[J]. 计算机工程与应用, 2013, 49(7): 252-257.

基于逻辑行和最大接纳距离的网页正文抽取

Web content extraction method based on logic lines and maximum admitting distances

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics