计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (6): 227-230.

• 工程与应用 • 上一篇    下一篇

一种基于结构分析的网页主题区域发现方法

伊  政,徐武平,徐爱萍   

  1. 武汉大学 计算机学院,武汉 430072
  • 出版日期:2015-03-15 发布日期:2015-03-13

Discovery method of webpage subject area based on structural analysis

YI Zheng, XU Wuping, XU Aiping   

  1. Computer School of Wuhan University, Wuhan 430072, China
  • Online:2015-03-15 Published:2015-03-13

摘要: 随着互联网的发展,Web数据挖掘在帮助人们获取主题信息方面越来越具有重要意义。本研究基于树结构,将Web网页解析为标签树;在树匹配算法的基础上,提出了数据区域挖掘和语义链接块识别算法,实现了去链接的预处理;提出了文本结构权重的概念,并采用文本结构权重的计算结果发现主题区域,去噪后获得主题信息。实验表明该研究结果对新闻、博客类网页具有很好的识别效果。

关键词: 信息抽取, 主题区域, 文本结构权重, 去噪

Abstract: Along with the development of the Internet, the Web Data Mining(DM) is becoming more and more significant with regard to the acquisition of thematic information. This paper parses the webpage into tag trees based on the tree structure, puts forward the data range mining and semantic chained block recognition algorithm based on the tree matching algorithm, carries out the preprocessing for unlinking, raises the concept of text structure weight, discovers the subject area with the calculation result of text structure weight and acquires the thematic information after denoising. The experiment shows that the research result displayed in this paper is of great importance to the identification of news and blog webpage.

Key words: information extraction, subject area, text structure weight, denoising