计算机工程与应用 ›› 2007, Vol. 43 ›› Issue (2): 190-190.

• 数据库与信息处理 • 上一篇    下一篇

网上表格数据到XML的自动转换

张瑞,李石君   

  1. 武汉大学计算机学院
  • 收稿日期:2006-05-10 修回日期:1900-01-01 出版日期:2007-01-11 发布日期:2007-01-11
  • 通讯作者: 李石君 shjli shjli

Automatic Conversion of HTML Tables into XML

,   

  1. 武汉大学计算机学院
  • Received:2006-05-10 Revised:1900-01-01 Online:2007-01-11 Published:2007-01-11

摘要: 互联网上有大量信息采用HTML表格表示,由于HTML不描述数据的内容,机器不能理解和查询。论文利用HTML表格属性,在表格中插入冗余单元,使HTML表格规范化; 对没有标志表头的HTML表格,采用格式化的信息的量化值识别网上表格的表头。在此基础上,提出了通过获取表格属性与值对应的语义层次,自动转换HTML表格数据为XML文挡的新方法。

关键词: 信息提取, XML, HTML表格, Web

Abstract: A large amount of information available on the Web is formatted in HTML tables, which are not content-oriented, and are not suitable for understanding and query by machines. In this paper, we normalize the HTML tables by inserting redundant cells into them according the attributes of HTML tables. For some HTML tables without marked headings we recognize its headings by using the measure of formatting information. By capturing the attribute-value pairs according to the headings and their corresponding data cells based on the normalized table, we present the new approach to automatically convert HTML tables into XML documents.

Key words: information extraction, XML, HTML table, Web