Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (5): 70-77.DOI: 10.3778/j.issn.1002-8331.2206-0472

• Theory, Research and Development • Previous Articles     Next Articles

Prediction of Disease Gene Based on Fusion Features and GA-SVM Algorithm

TAN Zhuokun, LUO Longfei, WANG Shunfang   

  1. School of Information Science and Engineering, Yunnan University, Kunming 650500, China
  • Online:2023-03-01 Published:2023-03-01



  1. 云南大学 信息学院,昆明 650500

Abstract: The feature information provided by a single biological data network is limited. Aiming at this problem, a multi-network feature fusion method based on semi-supervised autoencoder is proposed to enrich feature information. In addition, in order to solve the problem of artificially setting the hyperparameters of the model, it is easy to cause problems such as low model performance and falling into local optimum, it is further proposed to use the genetic algorithm to optimize the support vector machine(GA-SVM algorithm) to improve the predictive performance of brain disease genes. First, the similarity data networks from different data sources are constructed, then the features are extracted from the four data networks by using the random walk with restart algorithm, and processed and fused by semi-supervised autoencoder, finally, under the strategy of 10-fold cross validation, GA-SVM algorithm model is used to predict disease genes, and compared with other algorithms. The experimental results show that the AUC and AUPR values on the PD dataset are 0.805 and 0.792, respectively, while the AUC and AUPR values on the MDD dataset are 0.825 and 0.823, respectively, which are superior to the existing models. It is proved that this method can effectively improve the prediction effect of brain disease genes.

Key words: GA-SVM algorithm, multi-network fusion, semi-supervised autoencoder, brain disease genes, 10-fold cross validation

摘要: 单一生物数据网络提供的特征信息是十分受限的,针对这一问题,提出了一种基于半监督自编码器的多网络特征融合方法,丰富特征信息。此外,为解决在人为设置模型的超参数时,易出现模型性能较低、陷入局部最优等问题,进一步提出了利用遗传算法优化支持向量机(GA-SVM算法)模型的方法,提高脑部疾病基因的预测性能。构建来自不同数据源的相似性数据网络,利用重启随机游走算法从四个数据网络中提取特征,通过半监督自编码器进行处理及融合,在十折交叉验证的策略下使用GA-SVM算法模型预测脑部疾病基因,并与其他算法进行比较。实验结果表明,在PD数据集上的AUC和AUPR值分别为0.805、0.792,而在MDD数据集上的AUC和AUPR值分别为0.825、0.823,均优于已有的预测模型,有效证明了该方法能够提高脑部疾病基因的预测效果。

关键词: GA-SVM算法, 多网络融合, 半监督自编码器, 脑部疾病基因, 十折交叉验证