计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (3): 155-165.DOI: 10.3778/j.issn.1002-8331.2404-0088

• 理论与研发 • 上一篇    下一篇

面向基因调控网络的基因关联分析算法

李志杰,廖莎,刘安丰,李青蓝   

  1. 1.湖南理工学院 信息科学与工程学院,湖南 岳阳 414006
    2.中南大学 计算机学院,长沙 410083
    3.美国宾夕法尼亚大学 医学院,宾夕法尼亚 费城 19019
  • 出版日期:2025-02-01 发布日期:2025-01-24

Gene Association Analysis Algorithm for Gene Regulatory Network

LI Zhijie, LIAO Sha, LIU Anfeng, LI Qinglan   

  1. 1.School of Information Science and Engineering, Hunan Institute of Science and Technology, Yueyang, Hunan 414006, China
    2.School of Computer, Central South University, Changsha 410083, China
    3.Medical College, University of Pennsylvania, Philadelphia, Pennsylvania 19019, USA
  • Online:2025-02-01 Published:2025-01-24

摘要: 基因调控网络是基于微阵列基因表达数据,对基因之间表达关系依赖程度的一种仿真或重建。从基因表达数据挖掘基因之间存在的一定程度因果关系,对重构基因调控网络具有十分重要的意义。提出一种基于频繁原子序列关联熵的基因关联分析算法,通过基因关联熵有效识别基因之间的因果关系,并采用启发式搜索策略构建基因关联贝叶斯调控网络(gene association based Bayesian regulatory,GABR)。与基因贝叶斯网络描述基因表达水平值之间依赖关系不同,GABR是一种基因序列贝叶斯网络,基因关联分析对象是生物组织样本的基因表达值排序并置换为基因列下标所形成的序列。算法的优势在于基因变量取值原子序列,该基因为原子序列的结果,基因关联熵以及条件概率分布的计算更符合基因表达数据分析的生物本质特征。ALARM网络模拟数据的实验结果表明,基因关联分析算法性能明显优于同类算法。在酵母菌微阵列基因数据GDS2267和小鼠胚胎基因GSE76118等GEO数据集进行实验,测试结果表明GABR方法重构的基因调控网络具有较高的有效性和鲁棒性。

关键词: 基因表达数据, 基因调控, 频繁原子序列, 关联熵, 基因序列贝叶斯网络

Abstract: The gene regulatory network is a simulation or reconstruction based on microarray gene expression data to assess the degree of dependence on gene expression relationships. Mining causal relationships between genes from gene expression data is of great significance for reconstructing gene regulatory networks. This paper proposes a gene association analysis algorithm based on association entropy of frequent atomic sequence, which effectively identifies causal relationships between genes through gene association entropy, and constructs a gene association Bayesian regulatory network(GABR) using heuristic search strategy. Unlike gene Bayesian networks that describe the dependency relationship between gene expression levels, GABR is a gene sequence Bayesian network that analyzes gene association by sorting the gene expression values of biological tissue samples and permeating them with gene column indices. The advantage of the algorithm lies in the fact that the variable value of gene is atomic sequence, and this gene is the effects of the atomic sequence, and the calculation of gene association entropy and gene conditional probability distribution is more in line with the biological essential characteristics of gene expression data analysis. The experimental results of simulating data in ALARM network show that the performance of the gene association analysis algorithm is significantly better than that of similar algorithms. Experiments are conducted on GEO datasets such as yeast microarray gene data GDS2267 and mouse embryo gene GSE76118, and experimental results show that the gene regulatory network reconstructed by the GABR method has high effectiveness and robustness.

Key words: gene expression data, gene regulatory, frequent atomic sequence, association entropy, gene sequence Bayesian network