计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (9): 156-158.

• 数据库、信号与信息处理 • 上一篇    下一篇

基于模板化的Blog信息抽取

时达明,林鸿飞,赵 晶   

  1. 大连理工大学 计算机科学与工程系,辽宁 大连 116024
  • 收稿日期:2007-06-05 修回日期:2007-11-22 出版日期:2008-03-21 发布日期:2008-03-21
  • 通讯作者: 时达明

Blog information extraction based on template

SHI Da-ming,LIN Hong-fei,ZHAO Jing   

  1. Department of Computer Science and Engineering,Dalian University of Technology,Dalian,Liaoning 116024,China
  • Received:2007-06-05 Revised:2007-11-22 Online:2008-03-21 Published:2008-03-21
  • Contact: SHI Da-ming

摘要: Blog(博客)可以称为在线个人日志。作为一种新兴的媒体,Blog目前已经成为一种在Web上表达个人观点和情感的一种非常流行的方式。那么如何从Blog中快速准确地抽取有用的信息(话题发布时间、话题题目、话题内容、评论内容等)就成为了Blog应用中一个非常重要的步骤。提出了一种基于模板化的Blog信息抽取方法,该方法通过分析Blog网站的HTML源代码,然后提取出网站的模板,并根据该模板对Blog网页进行信息抽取。对来自国内10个著名博客网站进行模板的提取,并对这10个网站中的7 374个Blog网页进行了实验,实验结果表明,该方法能根据提取出的模板快速、准确地对Blog网页进行信息抽取。

关键词: 博客, 信息抽取, 模板

Abstract: Blog is called online personal diaries.Being a kind of rising media,Blog has become a prevalent way to express personal opinions and emotions on Web.So how to extract useful information(topic posting date,topic title,topic content,comments,etc.) from Blogs has become an important step in Blogs’ application.This paper presents an approach of Blog information extraction based on template.This approach generates templates of Blog web sites by analyzing source codes,and it then extracts Blog web pages according to these templates.In this paper,templates of 10 famous Blog web sites are extracted,and experiment results on a set of 7 374 web pages from these 10 web sites show that this approach can extracted information from Blogs rapidly and exactly according to the templates.

Key words: Blog, information extraction, template