Computer Engineering and Applications ›› 2012, Vol. 48 ›› Issue (30): 151-156.

Previous Articles     Next Articles

Content extraction of theme pages based on body feature and page structure

DUAN Xiaoli1, WANG Yu1, GU Jing2, LIU Weinan1   

  1. 1.School of Management, Dalian University of Technology, Dalian, Liaoning 116024, China
    2.Department of Economics, Environmental Management College of China, Qinhuangdao, Hebei 066004, China
  • Online:2012-10-21 Published:2012-10-22

基于正文特征及网页结构的主题网页信息抽取

段晓丽1,王  宇1,谷  静2,刘玮楠1   

  1. 1.大连理工大学 管理科学与工程学院,辽宁 大连 116024
    2.中国环境管理干部学院 经济学系,河北 秦皇岛 066004

Abstract: Web text extraction is the foundation of Web information processing work(information retrieval, text mining, etc.). Based on the statistical analysis of theme pages, including body features and structure characteristics, this paper puts forward a kind of theme pages text extraction method combining Web page text features and HTML tags characteristics. The text content block is acquired according to the DOM tree parsed from the Web pages, and then the characteristics of noise information are analysed in the text content block in order to remove the noise information. Experiments show this method has higher accuracy and recall rate.

Key words: body feature, tag information, content extraction

摘要: Web正文信息抽取是信息检索、文本挖掘等Web信息处理工作的基础。在统计分析了主题网页的正文特征及结构特征的基础上,提出了一种结合网页正文信息特征及HTML标签特点的主题网页正文信息抽取方法。在将Web页面解析成DOM树的基础上,根据页面DOM树结构获取正文信息块,分析正文信息块块内噪音信息的特点,去除块内噪音信息。实验证明,这种方法具有很好的准确率及召回率。

关键词: 正文特征, 标签信息, 正文抽取