计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (28): 124-127.

• 数据库、信号与信息处理 • 上一篇    下一篇

结合类内集中度和最小集合覆盖的特征选择

张文鹏1,李红婵2,王 兴1   

  1. 1.南阳师范学院 软件学院,河南 南阳 473061
    2.郑州轻工业学院 计算机与通信工程学院,郑州 450002
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-10-01 发布日期:2011-10-01

Feature selection combined category concentration with minimal set covering

ZHANG Wenpeng1,LI Hongchan2,WANG Xing1   

  1. 1.School of Software,Nanyang Normal University,Nanyang,Henan 473061,China
    2.School of Computer and Communication Engineering,Zhengzhou University of Light Industry,Zhengzhou 450002,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-10-01 Published:2011-10-01

摘要: 特征选择是文本分类中的核心研究课题之一。简单分析了词频和文档频,在此基础上提出了类内集中度,把集合覆盖的思想引入粗糙集并提出了一个基于最小集合覆盖的属性约简算法,把该属性约简算法同类内集中度结合起来,提出了一个新的特征选择方法。该方法利用类内集中度进行特征初选以过滤掉一些词条来降低特征空间的稀疏性,利用所提约简算法消除冗余,从而获得较具代表性的特征子集。实验结果表明此种特征选择方法效果良好。

关键词: 特征选择, 文本分类, 词频, 文档频, 粗糙集, 属性约简

Abstract: Feature selection is one of the core research topics in text categorization.Word frequency and document frequency are analyzed simply.Category concentration based on word frequency and document frequency is presented.Set covering is introduced into rough sets and an attribute reduction algorithm based on minimal set covering is provided.A new feature selection method combined the provided attribute reduction algorithm with the category concentration is proposed.The new method uses the category concentration to select feature and filter out some terms to reduce the sparsity of feature spaces,and then employs the proposed attribute reduction algorithm to eliminate redundancy,so that the more representative feature subset is acquired.The experimental results show that the new method is promising.

Key words: feature selection, text categorization, word frequency, document frequency, rough sets, attribute reduction