Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (4): 128-130.DOI: 10.3778/j.issn.1002-8331.2011.04.035

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Study on improved CHI for feature selection in Chinese text categorization

PEI Yingbo1,LIU Xiaoxia2   

  1. College of Information Science & Technology,Northwest University,Xi’an 710127,China
  • Received:2009-05-18 Revised:2009-07-06 Online:2011-02-01 Published:2011-02-01
  • Contact: PEI Yingbo

文本分类中改进型CHI特征选择方法的研究

裴英博1,刘晓霞2   

  1. 西北大学 信息科学与技术学院,西安 710127
  • 通讯作者: 裴英博

Abstract: This paper analyzes the factors which influence the CHI categorization accuracy and removes the negative correlation between the items and the category.The improved approach is applied to weight adjustment,obviously improving categorization quality.Furthermore,concentration information,distribution information and frequency information are introduced into the improved approach,which increases the categorization accuracy on the corpus of category uneven distribution.The experimental results verify the efficiency and probability of the improved CHI approach.

Key words: text classification, feature selection, CHI statistical approach, weight adjustment techniques, distribution information, concentration information, frequency information

摘要: 分析了影响传统CHI统计方法分类精度的因素,去除了特征项与类别负相关的情况。同时将改进后的方法用于特征词的权重调整,使其分类效果有了明显提高;将分散度、集中度、频度等因素引入到改进后的方法中,提高了其在类分布不均匀语料集上的分类精确度。最后通过实验证明了该方法的有效性和可行性。

关键词: 文本分类, 特征选择, CHI统计, 权值调整, 分散度, 集中度, 频度, ,

CLC Number: