计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (6): 145-150.DOI: 10.3778/j.issn.1002-8331.1712-0265

• 模式识别与人工智能 • 上一篇    下一篇

序列信息融合与两阶段特征选择的膜蛋白预测

郭  磊,王顺芳   

  1. 云南大学 信息学院 计算机科学与工程系,昆明 650504
  • 出版日期:2019-03-15 发布日期:2019-03-14

Prediction of Membrane Protein Based on Sequence Information Fusion and Two-Stage Feature Selection

GUO Lei, WANG Shunfang   

  1. Department of Computer Science and Engineering, School of Information Science and Engineering, Yunnan University, Kunming 650504, China
  • Online:2019-03-15 Published:2019-03-14

摘要: 膜蛋白的功能与其类型密切相关,因此膜蛋白类型的预测具有重要意义。针对膜蛋白特征表达过程中出现的特征维数高的问题,结合最大信息系数与遗传算法提出一种两阶段特征选择(MIC-GA)。抽取膜蛋白序列信息中的伪氨基酸组成、二肽组成和位置特异性分数矩阵等特征融合后作为特征参数,并在融合过程中提出一种改进的ReliefF算法(FReliefF)得到更有效的特征分数。基于Stacking集成学习框架,两次使用极端随机树对膜蛋白类型进行合理化预测。结果表明该方法能够有效提高膜蛋白预测的准确率。

关键词: 膜蛋白预测, 最大信息系数, 遗传算法, 特征选择, 特征融合, 极端随机树

Abstract: Researching on membrane protein type prediction is of great significance, because the type of membrane protein is exceedingly related with its function. In this study, a two-stage feature selection method is proposed(MIC-GA), which is on the basis of Maximum Information Coefficient(MIC) and Genetic Algorithm(GA), to address the problem of high-dimensional feature in the process of feature extraction for membrane protein. Three kinds of feature representations, PseAAC, DC and PSSM, are extracted from a membrane protein sequence. In the process of feature fusion, an improved ReliefF algorithm(FReliefF) is proposed to obtain an effective feature score. Ultimately the extremely randomized tree is used two times based on Stacking ensemble learning framework to realize a reasonable prediction of membrane protein types. The results show that the proposed method can improve the accuracy of membrane protein prediction efficiently.

Key words: membrane protein type prediction, maximum information coefficient, genetic algorithm, feature selection, feature fusion, extremely randomized tree