Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (7): 109-114.DOI: 10.3778/j.issn.1002-8331.1510-0030

 Research on obtaining deep Web interfaces method

YANG Yonghong 1, GAO Lei 1, YU Hang2 , XU Xinchen2   

  1. 1.Exploration & Development Research Institute, Shengli Oilfield Branch Company SINOPEC, Dongying, Shandong 257000, China
    2.College of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
  • Online:2017-04-01 Published:2017-04-01

Deep Web接口的自动识别技术研究

杨永红1,高  磊1,余  航2,徐欣辰2   

  1. 1.中国石化胜利油田分公司 勘探开发研究院,山东 东营 257000
    2.上海大学 计算机工程与科学学院,上海 200444

Abstract: Most getting Deep Web interface method is to get the <form></form> tag in a page , and then judge it’s a Deep Web query interface or not. The interface block concept is proposed. Based on the vision information, the interface position page is located. By extracting appropriate form architectural feature and applying classification algorithm combining C4.5 decision tree and SVM, so as the query interface is found out within the interface block. TEL-8 data sets of UIUC are adopted in the experiments, and the findings indicate that the method reaches the accuracy of 97.30%, and it is of good feasibility and practicability.

Key words: Deep Web interface, Document Object Model(DOM) tree, interface block , multi-class classification

摘要: 获取Deep Web中信息的主要途径是通过在其提供的查询接口上提交查询来实现的,目前大部分的研究以表单内的<form></form>标签获得表单内容结构,判断是不是一个Deep Web查询接口。提出了接口块的概念,设计了一种基于页面信息和视觉信息的接口块定位方法,最后将判定接口块是不是Deep Web接口看作是一个模式识别的分类问题,通过抽取适当的表单结构特征,采用C4.5决策树和SVM相结合的分类算法来进行接口块的判定,得到页面中含有的Deep Web查询接口。采用UIUC的TEL-8数据集进行实验,结果表明,该方法的准确率达到了97.30%,具有良好的可行性和实用性。

关键词: Deep Web接口, 文档对象化模型树, 接口块, 多类分类