计算机工程与应用 ›› 2013, Vol. 49 ›› Issue (5): 123-126.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

结构特征和内容分析融合的博客文章分类

张  永,王  芳,张译匀   

  1. 兰州理工大学 计算机通信学院,兰州 730050
  • 出版日期:2013-03-01 发布日期:2013-03-14

Structural characteristics and content analysis fusion for blog post classification

ZHANG Yong, WANG Fang, ZHANG Yiyun   

  1. School of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
  • Online:2013-03-01 Published:2013-03-14

摘要: 针对博客文章内容上,包含多个主题,类别归属不明显,多为作者自己主观意见且结构上,包括不同于文本的标签,普通文本分类方法直接应用于博客文章效果不理想的问题,提出一种结构特征和内容分析融合的博客文章分类方法。内容上,通过迭代两种不同特征选择方法,提高特征集代表性的前提下,利用正文,标题两个方面分类.结构上,利用博客文章特有的标签分类,并将三个方面融合。实验结果表明,改进的分类方法有效地提高了博客文章分类的性能。

关键词: 文本分类, 博客文章分类, 结构特征, 内容分析

Abstract: Aiming at the problems of blog posts contents including multiple themes, unobvious categories ownership and more author’s subjective views, structures including tags which are different from texts, common text classification methods not performing well, a new blog posts classification method is presented based on structural characteristics and content analysis. By taking into account blog posts content features, it iterates two different feature extraction methods to enhance the representative ability of feature collection effectively, makes use of main body and title classification. By taking into account the structural features of blog posts, it makes use of tags classification and finally fuses three aspects. The experimental results show that the performance of the improved method is obviously better than common text classification methods.

Key words: text classification, blog post classification, structural characteristics, content analysis