Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (26): 216-219.DOI: 10.3778/j.issn.1002-8331.2009.26.065

• 工程与应用 • Previous Articles     Next Articles

Prediction of protein coding regions by Takagi-Sugeno model

GUO Shuo1,2,ZHU Yi-sheng1   

  1. 1.College of Information Engineering,Dalian Maritime University,Dalian,Liaoning 116026,China
    2.College of Information Engineering,Shenyang Institute of Chemical Technology,Shenyang 110142,China
  • Received:2009-03-03 Revised:2009-04-10 Online:2009-09-11 Published:2009-09-11
  • Contact: GUO Shuo

蛋白质编码区的Takagi-Sugeno模糊模型辨识

郭 烁1,2,朱义胜1   

  1. 1.大连海事大学 信息工程学院,辽宁 大连 116026
    2.沈阳化工学院 信息工程学院,沈阳 110142
  • 通讯作者: 郭 烁

Abstract: An important step in gene identification is to predict coding regions in DNA sequence.Due to the large volume of gene data leading to the problem of poor generalization capability and lower computing speed in many algorithms of prediction of coding region.In this paper,a Takagi-Sugeno model of DNA sequence is built based on the different composition of nucleotides in coding regions and non-coding regions.First,the system is quickly divided into several fuzzy parts using clustering algorithm based on the fuzzy likelihood function.Then,regarding cluster number as a rule number,Takagi—Sugeno fuzzy model has been built.Finally,the consequent parameters of the model are identified associating with LS.The algorithm not only can predict coding regions,but also can identify the first nucleotide of the codon in coding regions.This is very significant for accurate translation into a protein sequence.The algorithm is simple and simulation results show the proposed method is more effective for coding regions prediction than the existing coding region discovery tools.

Key words: coding region in DNA sequence, codon, Takagi-Sugeno model, clustering algorithm, Least Square(LS)

摘要: DNA序列编码区的辨识是基因辨识的一个重要方面。由于基因序列数据量大,导致许多统计辨识算法泛化性差、运算速度慢。根据编码区域序列和非编码区域序列相比有不同的碱基组成,提出将Takagi-Sugeno模型用于DNA序列的编码区辨识。首先,用基于模糊似然函数的模糊聚类算法确定系统的模糊划分数目,进而根据聚类个数建立相应的Takagi-Sugeno局部线性化模型,最后用最小二乘法实现模型结论参数的辨识。该算法不仅可以确定编码区的位置,还可以辨识出密码子第一位碱基的位置,对蛋白质结构的研究是非常重要的。算法简单、高效。仿真结果表明,该算法非常适合编码区辨识和其他编码区辨识算法有可比性。

关键词: DNA序列编码区, 密码子, Takagi-Sugeno模糊模型, 模糊聚类, 最小二乘法

CLC Number: