计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (5): 143-148.DOI: 10.3778/j.issn.1002-8331.1710-0297

• 模式识别与人工智能 • 上一篇    下一篇

改进CKSAAP结合RFE算法预测蛋白质棕榈酰化位点

汤亚东1,谢  鹭2,陈兰明1   

  1. 1.上海海洋大学 食品科学与技术学院,上海 201306
    2.上海生物信息技术研究中心,上海 201203
  • 出版日期:2019-03-01 发布日期:2019-03-06

Identification of Palmitoylation Sites of Proteins Using Modified CKSAAP Combined with RFE Method

TANG Yadong1, XIE Lu2, CHEN Lanming1   

  1. 1.College of Food Science and Technology, Shanghai Ocean University, Shanghai 201306, China
    2.Shanghai Center for Bioinformation Technology, Shanghai 201203, China
  • Online:2019-03-01 Published:2019-03-06

摘要: 蛋白质棕榈酰化是一种可逆的蛋白质翻译后修饰,在蛋白质稳定性和亚细胞定位等方面发挥重要作用。构建了一种预测蛋白质棕榈酰化位点的新模型(PSSM-CKSAAP-RFE)。采用蕴含进化信息的[k]-spaced氨基酸对组分方法表征蛋白质序列,通过递归特征消除法进行特征选择;基于上述特征训练支持向量机分类器,并采用夹克刀交叉验证法测试模型性能。研究结果显示,训练集和独立测试集的预测准确率、马修斯相关系数、特异性、敏感性和受试者工作特征曲线下面积分别为98.44%、0.94、98.95%、95.65%和0.990,以及98.41%、0.93、99.39%、92.31%和0.994,优于文献中报道的相关方法,为蛋白质棕榈酰化位点的预测提供了一种新模型。

关键词: 蛋白质棕榈酰化位点, [k]-spaced氨基酸对组分, 位置特异性得分矩阵, 支持向量机, 递归特征消除

Abstract: Protein palmitoylation is reversible post-translational modification and plays important roles in protein stability, subcellular localization and many other functions. In this study, a new model to identify palmitoylation sites is constructed, designated as PSSM-CKSAAP-RFE. The evolutionary information of amino acid residues involved in tested proteins is represented by a Composition of k-Spaced Amino Acid Pairs(CKSAAP) method. Optional features are selected using a Recursive Feature Elimination(RFE) method. The Support Vector Machine(SVM) classifier is trained using the chosen features, and the performance of the model is examined using a Jackknife Cross Validation Test(JCVT). The resulting data shows that the values of accuracy, Matthews correlation coefficient, specificity, sensitivity and area under receiver operating characteristic curves(AUC) for the identification of palmitoylation sites are 98. 44%, 0.94, 98.95%, 95.65% and 0.990, as well as 98.41%, 0.93, 99.39%, 92.31% and 0.994 for the train dataset and test dataset, respectively, which are superior to previous methods in the literature. This study provides a new model for the identification of palmitoylation sites of proteins.

Key words: protein palmitoylation sites, composition of k-spaced amino acid pairs, position specific scoring matrix, support vector machine, recursive feature elimination