计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (36): 238-244.

• 工程与应用 • 上一篇    下一篇

基于数据划分和集成的方法预测信号肽

王  怡1,2,郭躬德1,2,孔祥增1,2   

  1. 1.福建师范大学 数学与计算机科学学院,福州 350007
    2.福建师范大学 网络安全与密码技术重点实验室,福州 350007
  • 出版日期:2012-12-21 发布日期:2012-12-21

Method based on data dividing and integration for predicting signal peptides

WANG Yi1,2, GUO Gongde1,2, KONG Xiangzeng1,2   

  1. 1.School of Mathematics and Computer Science, Fujian Normal University, Fuzhou 350007, China
    2.Key Lab of Network Security and Cryptography, Fujian Normal University, Fuzhou 350007, China
  • Online:2012-12-21 Published:2012-12-21

摘要: 在信号肽预测问题中,由于信号肽序列长度不等且氨基酸组成具有多样性的特点,以往方法通常采用滑动窗口进行处理,从而导致了信息丢失以及数据不平衡等问题。为改善少数类预测效果,对训练数据进行了预处理,将多数类样本数据划分,生成的各组样本分别与少数类样本合并组成若干个数据子集,在两种蛋白质编码方案下采用概率神经网络建立多个分类器,采用加权投票将多分类器集成的方法预测信号肽。在目前广泛使用的Neilsen数据集上进行实验,表明该方法具有一定的有效性。

关键词: 信号肽预测, 不平衡数据集, 聚类划分, 概率神经网络, 多分类器融合

Abstract: As the length of signal peptide sequence is different and the composition of amino acid is diversified, most of existing methods in literature for signal peptides prediction employ scaling windows to deal with these problems, which lead to potential loss of useful information and imbalanced data problem. In order to improve the prediction performance of the class with minority samples, data preprocessing is used before employing traditional probabilistic neural networks to build classifiers: the class with majority samples is divided into several groups, and then several data subsets are respectively constituted by combining each group with minority samples, which are used to train probabilistic neural network classifiers. The ensemble system finally combines results through ballot from a series of classifiers worked on two different coding of proteins sequences. The experiments carried out on the popular Neilsen dataset show the effectiveness of the proposed algorithm.

Key words: signal peptides prediction, imbalanced data sets, clustering dividing, probabilistic neural networks, multiple classifiers combination