Content extraction of theme pages based on body feature and page structure

Computer Engineering and Applications ›› 2012, Vol. 48 ›› Issue (30): 151-156.

Previous Articles Next Articles

Content extraction of theme pages based on body feature and page structure

DUAN Xiaoli1, WANG Yu1, GU Jing2, LIU Weinan1

1.School of Management, Dalian University of Technology, Dalian, Liaoning 116024, China
2.Department of Economics, Environmental Management College of China, Qinhuangdao, Hebei 066004, China

Online:2012-10-21 Published:2012-10-22

基于正文特征及网页结构的主题网页信息抽取

段晓丽1，王宇1，谷静2，刘玮楠1

1.大连理工大学管理科学与工程学院，辽宁大连 116024
2.中国环境管理干部学院经济学系，河北秦皇岛 066004

Abstract

Abstract: Web text extraction is the foundation of Web information processing work（information retrieval, text mining, etc.）. Based on the statistical analysis of theme pages, including body features and structure characteristics, this paper puts forward a kind of theme pages text extraction method combining Web page text features and HTML tags characteristics. The text content block is acquired according to the DOM tree parsed from the Web pages, and then the characteristics of noise information are analysed in the text content block in order to remove the noise information. Experiments show this method has higher accuracy and recall rate.

Key words: body feature, tag information, content extraction

摘要： Web正文信息抽取是信息检索、文本挖掘等Web信息处理工作的基础。在统计分析了主题网页的正文特征及结构特征的基础上，提出了一种结合网页正文信息特征及HTML标签特点的主题网页正文信息抽取方法。在将Web页面解析成DOM树的基础上，根据页面DOM树结构获取正文信息块，分析正文信息块块内噪音信息的特点，去除块内噪音信息。实验证明，这种方法具有很好的准确率及召回率。

关键词: 正文特征, 标签信息, 正文抽取

DUAN Xiaoli1, WANG Yu1, GU Jing2, LIU Weinan1. Content extraction of theme pages based on body feature and page structure[J]. Computer Engineering and Applications, 2012, 48(30): 151-156.

段晓丽1，王宇1，谷静2，刘玮楠1. 基于正文特征及网页结构的主题网页信息抽取[J]. 计算机工程与应用, 2012, 48(30): 151-156.

Content extraction of theme pages based on body feature and page structure

基于正文特征及网页结构的主题网页信息抽取

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 1

Recommended Articles

Metrics