计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (19): 113-117.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于改进卡方统计的微博特征提取方法

徐  明1,高  翔2,许志刚2,刘  磊2   

  1. 1.北京工业大学 现代教育技术中心,北京 100124
    2.北京工业大学 数理学院,北京 100124
  • 出版日期:2014-10-01 发布日期:2014-09-29

Feature selection methods of microblogging based on improved CHI-square statistics

XU Ming1, GAO Xiang2, XU Zhigang2, LIU Lei2   

  1. 1.Modern Technological Center in Education, Beijing University of Technology, Beijing 100124, China
    2.College of Applied Sciences, Beijing University of Technology, Beijing 100124, China
  • Online:2014-10-01 Published:2014-09-29

摘要: 通过对微博文本特征信息的分析与研究,提出一种基于改进卡方统计的微博特征提取方法。扩充微博信息分类特征,在传统的卡方统计量的基础上,引入了频度等因素,改进特征选择方法;在传统的特征项权值计算的基础上,提出了新的改进卡方统计量的方法,改进权重计算效果。对上述方法利用经典KNN和SVM算法进行了测试,实验结果表明该方法提高了微博信息分类的准确率。

关键词: 微博分类, 卡方统计量, 特征选择, 权值计算

Abstract: This paper analyzes the microblogging text feature information, and proposes a microblogging feature extraction method based on improved chi-square statistic. Firstly, the microblogging information classification features are expanded, microblogging features are increased frequency and other factors. It improves the traditional feature selection methods. Then, based on the traditional feature item weight calculation, the paper proposes a new improved method of CHI-square statistic for improving weight calculation results. Finally, the above method is tested by using the classical KNN and SVM algorithm, the experimental results show that this method improves the micro-blog information classification accuracy.

Key words: microblogging classification, CHI-square statistics, feature selection, weight calculation