计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (4): 247-255.DOI: 10.3778/j.issn.1002-8331.1811-0045

• 工程与应用 • 上一篇    下一篇

DBN在蛋白质编码区识别问题中的应用研究

胡青渝,刘广臣   

  1. 1.鲁东大学 数学与统计科学学院,山东 烟台 264025
    2.重庆大学 数学与统计学院,重庆 401331
  • 出版日期:2020-02-15 发布日期:2020-03-06

Application of Deep Belief Network in Recognition of Protein Coding Regions

HU Qingyu, LIU Guangchen   

  1. 1.School of Mathematics & Statistics Science, Ludong University, Yantai, Shandong 264025, China
    2.School of Mathematics & Statistics, Chongqing University, Chongqing 401331, China
  • Online:2020-02-15 Published:2020-03-06

摘要:

针对真核生物DNA序列中蛋白质编码区的识别问题,提出基于深度置信网络(Deep Belief Network,DBN)的组合模型。通过信号处理技术对真核生物的DNA序列进行数值转换,并结合统计学知识提取转换后DNA序列的数值特征;利用随机森林对所提取的特征变量降维;用深度置信网络模型对DNA序列分类判别;根据短时傅里叶变换(Short Time Fourier Transform,STFT)技术对外显子区准确定位。在三个标准测试集上比较组合模型与传统[Logistic]回归模型、贝叶斯判别模型的判别效果,结果显示,深度置信网络组合模型的准确率和特异度等指标都明显优于[Logistic]回归模型和贝叶斯判别模型。

关键词: 编码区识别, 信号处理, 随机森林, 深度置信网络(DBN), 短时傅里叶变换(STFT)

Abstract:

To identify the protein coding regions in eukaryotic DNA sequences, a combination model based on Deep Belief Network(DBN) is proposed. Firstly, the DNA sequence of eukaryotes is converted numerically by signal processing technology and combined with statistical knowledge to extract the numerical features of the transformed DNA sequence. Secondly, the dimensionality of the extracted features variables is reduced by random forest. Then, the DNA sequence is classified and distinguished by deep belief network model. Finally, the Short Time Fourier Transform(STFT) is used to locate the external exons accurately. The results show that the accuracy and specificity of deep belief network combination model are better than those of Logistic regression model and Bayes discriminant model.

Key words: coding region identification, signal processing, random forest, Deep Belief Network(DBN), Short Time Fourier Transform(STFT)