计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (22): 105-110.DOI: 10.3778/j.issn.1002-8331.1605-0293

• 模式识别与人工智能 • 上一篇    下一篇

基于归一化模糊联合互信息最大的特征选择

董泽民1,石  强2   

  1. 1.武汉科技大学 城市学院 实验实训中心,武汉 430083
    2.华中科技大学 软件学院,武汉 430000
  • 出版日期:2017-11-15 发布日期:2017-11-29

Feature selection using normalized fuzzy joint mutual information maximum

DONG Zemin1, SHI Qiang2   

  1. 1.Research and Training Center of City College, Wuhan University of Science and Technology, Wuhan 430083, China
    2.School of Software Engineering, Huazhong University of Science & Technology, Wuhan 430000, China
  • Online:2017-11-15 Published:2017-11-29

摘要: 特征选择就是从特征集合中选择出与分类类别相关性强而特征之间冗余性最小的特征子集,这样一方面可以提高分类器的计算效率,另一方面可以提高分类器的泛化能力,进而提高分类精度。基于互信息的特征相关性和冗余性的评价准则,在实际应用中存在以下的问题:(1)变量的概率计算困难,进而影响特征的信息熵计算困难;(2)互信息倾向于选择值较多的特征;(3)基于累积加和的候选特征与特征子集之间冗余性度量准则在特征维数较高的情况下容易失效。为了解决上述问题,提出了基于归一化模糊互信息最大的特征评价准则,基于模糊等价关系计算变量的信息熵、条件熵、联合熵;利用联合互信息最大替换累积加和的度量方法;基于归一化联合互信息对特征重要性进行评价;基于该准则建立了基于前向贪婪搜索的特征选择算法。在UCI机器学习标准数据集上的多组实验,证明算法能够有效地选择出对分类类别有效的特征子集,能够明显提高分类精度。

关键词: 模糊等价关系, 联合互信息, 最大最小准则, 特征选择

Abstract: Feature selection is the method that selects feature subset that has strong relevancy between features and classification and smallest redundancy among features from feature set. This can improve the classifier’s computational efficiency, and enhance the classifier’s generalization, and therefore increase classification accuracy. However, the relevance and redundancy evaluation criteria based on mutual information has the following problems in the practical applications:(1) It is difficult to calculate the probability of a variable and the feature’s information entropy; (2) The approach based on mutual information tends to choose features which have more values; (3) The method measuring redundancy between candidate features and selected feature subset based on cumulative addition with higher dimension data sets always is invalid. To solve the above problems, the feature evaluation criteria based on Normalized Fuzzy Joint Mutual Information Maximum(NFJMIM) is proposed in this paper. Firstly, the entropy, conditional entropy, joint entropy of a variable are calculated based on fuzzy equivalence relation. Secondly, the feature’s importance is evaluated base on NFJMIM. Finally, using the established criteria, forward greedy search approach is used for searching feature subset. Several experiments using UCI machine learning repository prove that the proposed algorithm can effectively select effective feature subset, and can significantly improve the classification accuracy.

Key words: fuzzy equivalence relations, joint mutual information, the maximum and minimum criteria, feature selection