Computer Engineering and Applications ›› 2008, Vol. 44 ›› Issue (20): 165-168.DOI: 10.3778/j.issn.1002-8331.2008.20.050

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Study of topic sentiment sentences auto-extraction in Chinese blogs

SUN Hong-gang,LU Yu-liang   

  1. No.604 Lab,Hefei Electronic Engineering Institute,Hefei 230037,China
  • Received:2007-09-26 Revised:2007-12-21 Online:2008-07-11 Published:2008-07-11
  • Contact: SUN Hong-gang

中文博客主题情感句自动抽取研究

孙宏纲,陆余良   

  1. 合肥电子工程学院 604实验室,合肥 230037
  • 通讯作者: 孙宏纲

Abstract: In the field of Chinese blog sentiment analysis,previous researchers put most energy on the polarity analysis of word,but not all the word analyzed is relative with the topic,and word-level granularity for sentiment analysis is too small.We try to use sentiment sentences,a sentence-level model,for sentiment analysis.In this paper,it only focuses on topic sentiment sentences auto-extraction.In order to extracting topic sentiment sentences,it designs a novel Bi-segment method to extract the main topic words,and uses TFIDF to extract more topic words.With these words,it recombines original sentences,which contain the topic words.So as long as topic sentiment sentences exist,they must in the set of recombined sentences.Then,based on the analysis of Chinese blogs,it converts the problem of extraction into Chinese chunking by CRFs and has a good performance in extraction experiment.

Key words: Chinese blogs, sentiment analysis, Conditional Random Fields(CRFs)

摘要: 博客作为一种大众化的信息及文化载体被越来越多的人所接受,博客信息的情感分析也逐渐成为了信息挖掘领域的热点。目前,在研究情感分析时,多是通过计算词汇的倾向性来完成的。由于并不是所有的带有情感色彩的词汇都是主题相关的,因此,以词为粒度的情感分析存在一定的缺陷。为了解决这一问题,试图从句子层面进行分析,主要研究了与之相关的主题情感句的自动提取问题。为了有效地提取主题相关情感句,设计了一个新颖的基于二元切分的提取算法来获取主题词,然后利用TFIDF算法获取更多的次要主题词,并利用这些主题词重组了那些包含主题词的原始句。因此,如果主题情感句存在的话,那么它一定在这些重组的主题句集合中,只要对该重组句集合进行分析、提取,便能得到主题情感句。最后,利用CRFs将主题句提取问题有效转化为了中文chunking问题,并在抽取实验中取得了很好的结果。

关键词: 中文博客, 情感分析, CRFs