Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (11): 136-141.DOI: 10.3778/j.issn.1002-8331.1802-0070

Previous Articles     Next Articles

Protein Subcellular Localization Prediction Based on SVM

LIU Qinghua1, LAI Yuping2, DING Hongwei1, YANG Zhijun1, Cui Xiaolong3   

  1. 1.School of Information Science and Engineering, Yunnan University, Kunming 650500, China
    2.College of Computer Science and Technology, North China University of Technology, Beijing 100144, China
    3.Institute of Microbiology, Yunnan University, Kunming 650500, China
  • Online:2019-06-01 Published:2019-05-30

基于SVM的蛋白质亚细胞定位预测

刘清华1,赖裕平2,丁洪伟1,杨志军1,崔晓龙3   

  1. 1.云南大学 信息学院,昆明 650500
    2.北方工业大学 计算机学院,北京 100144
    3.云南大学 微生物研究所,昆明 650500

Abstract: Based on feature fusion, combining amino acid composition, entropy density and autocorrelation coefficient to construct a 190 dimensional eigenvector for characteristic expression, this method can better express the protein structure information compared with the traditional method which only considers the amino acid composition information. It uses the  Linear Discriminant Analysis(LDA) method to reduce the calculation complexity and increases the correlation between the samples. The support vector machine is selected as the classifier for positioning prediction. It uses the Jackknife method to cross-check the gram-negative and gram-positive data sets. The experimental results show that the multi-feature combination method is superior to the traditional amino acid composition method and simple self-correlation coefficient method, and proves the validity of the new method.

Key words: feature fusion, entropy density, autocorrelation coefficient, Linear Discriminant Analysis(LDA), support vector machine

摘要: 首先基于特征融合思想,采用氨基酸组成、熵密度和自相关系数结合的方式构建190维特征向量进行特征表达,与仅考虑氨基酸组成信息的传统方法相比,能更好地表达蛋白质结构信息。然后利用LDA(Linear Discriminant Analysis)方法进行降维,降低计算复杂性,加强同类样本间的相关性。接下来选用支持向量机作为分类器进行定位预测,最后采用留一法在Gram-negative和Gram-positive数据集上进行交叉检验。实验结果表明,多特征结合的方法优于传统的氨基酸组成方法和简单的自相关系数方法,证明了新方法的有效性。

关键词: 特征融合, 熵密度, 自相关系数, 线性判别分析(LDA), 支持向量机