计算机工程与应用 ›› 2010, Vol. 46 ›› Issue (20): 1-3.DOI: 10.3778/j.issn.1002-8331.2010.20.001

• 博士论坛 • 上一篇    下一篇

使用特征文本密度的网页正文提取

王少康1,2,董科军1,阎保平1   

  1. 1.中国科学院 计算机网络信息中心,北京 100190
    2.中国科学院 研究生院,北京 100049
  • 收稿日期:2010-03-17 修回日期:2010-05-17 出版日期:2010-07-11 发布日期:2010-07-11
  • 通讯作者: 王少康

Web content information extraction using density of feature text

WANG Shao-kang1,2,DONG Ke-jun1,YAN Bao-ping1   

  1. 1.Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China
    2.Graduate School of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2010-03-17 Revised:2010-05-17 Online:2010-07-11 Published:2010-07-11
  • Contact: WANG Shao-kang

摘要: 针对当前互联网网页越来越多样化、复杂化、非规范化的特点,提出了基于特征文本密度的网页正文提取方法。该方法将网页包含的文本根据用途和特征进行分类,并构建数学模型进行比例密度分析,从而精确地识别出主题文本。该方法的时间和空间复杂度均较低。实验显示,它能有效地抽取复杂网页以及多主题段网页的正文信息,具有很好的通用性。

关键词: 文本密度, 文本特征, 信息抽取, 网页

Abstract: The current web pages are getting more and more diverse,complex and non-standardized which makes the information extraction more difficult,the paper proposes a web content information extraction method based on density of feature text,which classifies the page text according to its usage and features,and constructs mathematical models to analyze the text proportion and density,thus identifies the content information accurately.The method has rather low time and space complexity.Experiments show that it can extract content information effectively from complex and multi-topic web pages and has a wide applicability.

Key words: text density, text feature, information extraction, web page

中图分类号: