计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (7): 74-78.DOI: 10.3778/j.issn.1002-8331.1509-0226

• 大数据与云计算 • 上一篇    下一篇

基于结构一致和特征学习的网页信息标签提取

杜博远1,王美清1,陈长福2,陈  飞1   

  1. 1.福州大学 数学与计算机科学学院,福州 350000
    2.福建库易信息科技有限责任公司,福州 350000
  • 出版日期:2017-04-01 发布日期:2017-04-01

Tags extraction for Web information based on structure consistency and feature learning

DU Boyuan1, WANG Meiqing1, CHEN Changfu2, CHEN Fei1   

  1. 1.College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350000, China
    2.Fujian Ecallcen Information Technology Co., Ltd., Fuzhou 350000, China
  • Online:2017-04-01 Published:2017-04-01

摘要: 网页信息指网页的正文、标题、发布时间、媒体等,每个信息都存在于HTML文档特定的标签中,自动获取这些标签可以实现在相同模板下的网页信息自动提取,对于大规模抓取网页内容有很大帮助。由于在相同模板下不同网页之间结构一致,网页信息有一定统计特征,提出了一种基于结构对比和特征学习的网页信息标签自动提取算法。该算法包含三个步骤:网页对比、内容识别和标签提取。在51个模块下对1?620个网页进行测试,实验结果表明,通过提取标签获取网页信息不仅速度快,而且抓取的内容更加准确。

关键词: 网页标签, 信息提取, 特征学习, 结构一致

Abstract: The Web information refers to the special contents of the Web pages which usually includes main body, title, release date and release media. Each content is put in the corresponding HTML tags. Extracting automatically such tags is able to obtain Web information under the same Web template. Such tags extraction for Web information is a great help for clawing contents from a large number of Web pages. Since Web structure consistency for the same template and the statistical features of Web information, this paper proposes tags extraction automatically for Web information based on structure consistency and feature learning. The algorithm consists of three steps: Web contrast, content identification and tags extraction. Experimental results on 51 Web templates from 1 620 Web pages show that the proposed algorithm achieves Web information extraction not only high-speed but also high-accuracy.

Key words: Website tags, information extraction, feature learning, structure consistency