基于结构一致和特征学习的网页信息标签提取

doi:10.3778/j.issn.1002-8331.1509-0226

计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (7): 74-78.DOI: 10.3778/j.issn.1002-8331.1509-0226

基于结构一致和特征学习的网页信息标签提取

杜博远1，王美清1，陈长福2，陈飞1

1.福州大学数学与计算机科学学院，福州 350000
2.福建库易信息科技有限责任公司，福州 350000

出版日期:2017-04-01 发布日期:2017-04-01

Tags extraction for Web information based on structure consistency and feature learning

DU Boyuan1, WANG Meiqing1, CHEN Changfu2, CHEN Fei1

1.College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350000, China
2.Fujian Ecallcen Information Technology Co., Ltd., Fuzhou 350000, China

Online:2017-04-01 Published:2017-04-01

摘要/Abstract

摘要： 网页信息指网页的正文、标题、发布时间、媒体等，每个信息都存在于HTML文档特定的标签中，自动获取这些标签可以实现在相同模板下的网页信息自动提取，对于大规模抓取网页内容有很大帮助。由于在相同模板下不同网页之间结构一致，网页信息有一定统计特征，提出了一种基于结构对比和特征学习的网页信息标签自动提取算法。该算法包含三个步骤：网页对比、内容识别和标签提取。在51个模块下对1?620个网页进行测试，实验结果表明，通过提取标签获取网页信息不仅速度快，而且抓取的内容更加准确。

关键词: 网页标签, 信息提取, 特征学习, 结构一致

Abstract: The Web information refers to the special contents of the Web pages which usually includes main body, title, release date and release media. Each content is put in the corresponding HTML tags. Extracting automatically such tags is able to obtain Web information under the same Web template. Such tags extraction for Web information is a great help for clawing contents from a large number of Web pages. Since Web structure consistency for the same template and the statistical features of Web information, this paper proposes tags extraction automatically for Web information based on structure consistency and feature learning. The algorithm consists of three steps: Web contrast, content identification and tags extraction. Experimental results on 51 Web templates from 1 620 Web pages show that the proposed algorithm achieves Web information extraction not only high-speed but also high-accuracy.

Key words: Website tags, information extraction, feature learning, structure consistency

杜博远1，王美清1，陈长福2，陈飞1. 基于结构一致和特征学习的网页信息标签提取[J]. 计算机工程与应用, 2017, 53(7): 74-78.

DU Boyuan1, WANG Meiqing1, CHEN Changfu2, CHEN Fei1. Tags extraction for Web information based on structure consistency and feature learning[J]. Computer Engineering and Applications, 2017, 53(7): 74-78.

[1]	汪盼，宋雪桦，王昌达，陈锋，徐夏强，蔡冠宇. 基于改进的深度信念网络的入侵检测方法[J]. 计算机工程与应用, 2020, 56(20): 87-92.
[2]	程晓梅，沈远彤. 双目标的CNN无参考图像质量评价方法[J]. 计算机工程与应用, 2019, 55(9): 26-32.
[3]	颜丹，蒋加伏. 基于栈式去噪自动编码器的边际Fisher分析算法[J]. 计算机工程与应用, 2017, 53(5): 134-139.
[4]	尹晓燕1，冯志勇1，徐超2. 多尺度非监督特征学习的人脸识别[J]. 计算机工程与应用, 2016, 52(14): 136-141.
[5]	乔闹生1，张奋2. 一种印刷电路板缺陷图像边缘信息提取方法[J]. 计算机工程与应用, 2015, 51(20): 11-15.
[6]	阎继宁1，2，3，周可法1，2，王金林1，王珊珊1，汪玮1，李东1，2，3. 基于SAM与SVM的高光谱遥感蚀变信息提取[J]. 计算机工程与应用, 2013, 49(19): 141-146.
[7]	范帆1，关佶红2. 工程图纸字符串及标注信息提取[J]. 计算机工程与应用, 2012, 48(7): 161-164.
[8]	李龙翔1，胡晓东2，沈占锋2，明冬萍1，2，宋卓沁1，2. 对象化过程中的快速标号算法研究[J]. 计算机工程与应用, 2012, 48(6): 193-195.
[9]	戴芹，刘建波，刘士彬. 群智能方法在遥感信息提取中的应用分析[J]. 计算机工程与应用, 2011, 47(4): 13-16.
[10]	慈慧1，2，秦勇1，2，杨慧1，2，李国强3，酆格斐4. 滨海湿地信息提取方法比较研究[J]. 计算机工程与应用, 2011, 47(33): 244-248.
[11]	薄树奎，刘华. 类别划分对特定类别信息提取的影响[J]. 计算机工程与应用, 2011, 47(24): 193-195.
[12]	李晓霞1，2，汪云甲1，2. 面向对象的高分辨率影像采煤塌陷地提取[J]. 计算机工程与应用, 2011, 47(23): 239-241.
[13]	戢晓峰¹，黄永忠³，何增辉²，韩春华¹. 面向诱导的交通状态信息提取方法[J]. 计算机工程与应用, 2010, 46(25): 16-18.
[14]	方加沛，黄战. 基于单类别文档分类的主题爬虫[J]. 计算机工程与应用, 2010, 46(16): 63-66.
[15]	孟军,刘秋水,王秀坤. 节点频度和语义距离相结合的网页正文信息抽取[J]. 计算机工程与应用, 2009, 45(1): 140-143.

基于结构一致和特征学习的网页信息标签提取

Tags extraction for Web information based on structure consistency and feature learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics