Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (29): 124-126.

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Blog posts classification method based on analysis of article elements

LU Mengping1,HUANG Han1,CAI Zhaoquan2,ZHU Yifan1,HE Yiyu1,XU Zhenyu1   

  1. 1.School of Software Engineering,South China University of Technology,Guangzhou 510006,China
    2.Educational Technology Center,Huizhou University,Huizhou,Guangdong 516007,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-10-11 Published:2011-10-11

基于文章要素影响分析的博客文章分类方法

鲁梦平1,黄 翰1,蔡昭权2,朱一帆1,何翊宇1,徐震宇1   

  1. 1.华南理工大学 软件学院,广州 510006
    2.惠州学院 教育技术中心,广东 惠州 516007

Abstract: Traditional text classification methods are directly used to classify blog posts without considering characteristics of blog posts,so this paper proposes a method to improve classification results by considering the impact of article elements.This paper proposes an easy method to get rid of noisy posts in order to ensure the reliability of the posts;blog tags are used to extend the thesaurus so as to improve words segment and the accuracy of blog classification;G1 method proposed in comprehensive evaluation model is used to calculate the weights of title,tag,label,first paragraph,last paragraph and other part,which are to be analyzed in blog classification.Experimental results show that this method can gain better classification performance than traditional TF-IDF method.

Key words: blog posts classification, blog text filtering, blog tags, article element, G1 method

摘要: 现有的博客文章分类的研究通常直接沿用传统文本分类方法,并没有结合博客自身的特点。研究基于文章要素的影响分析实现分类效果的改进。提出了一种简单的博客文本去噪方法,以保证博客数据的可靠性;提出了基于博客标签的中文词库扩展方法,用于改善中文分词效果,以提高博客分类的准确性;根据综合评价模型G1法计算博客文章中标题、标签、类别、首段、末段以及正文等文章要素的权重,分析它们对博客分类的影响。实验结果表明,研究提出的方法比传统的TFIDF方法有更好的分类效果。

关键词: 博客文章分类, 博客文本去噪, 博客标签, 文章要素, G1法