Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (12): 149-154.DOI: 10.3778/j.issn.1002-8331.2109-0135

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Data Imputation for Methylation by Variational Auto-Encoder

WANG Xinfeng, HUANG Wei   

  1. 1.Software College, Jishou University, Jishou, Hunan 416000, China
    2.School of Data and Computer Science, Sun Yat-sen University, Guangzhou 510000, China
    3.School of Computer Science and Engineering, Central South University, Changsha 410000, China
  • Online:2022-06-15 Published:2022-06-15

变分自编码器对甲基化缺失数据的填补

王新峰,黄伟   

  1. 1.吉首大学 软件学院,湖南 吉首 416000
    2.中山大学 计算机学院,广州 510000
    3.中南大学 计算机学院,长沙 410000

Abstract: High-throughput sequencing technology is an important method for studying DNA methylation. Due to experimental and technical limitations, DNA methylation sequencing data contains some missing values. To solve the problem of missing values, the VAE-MethImp model based on variational auto-encoder for DNA methylation missing data imputation is proposed. VAE-MethImp is composed of an encoder layer, a hidden layer and a decoder layer. It is a deeply hidden space generation model with a powerful ability to reconstruct input data. The encoder layer infers the mean and variance; the hidden layer is the exclusive normal distribution of the input data calculated from the mean and variance output by the encoder layer; the decoder layer decodes the information contained in the hidden variables to generate reconstructed data. The imputation experiments on lung cancer and breast cancer prove that the features extracted by the VAE are more informative. The imputation accuracy of the VAE model is 4.8% higher than the optimal SVD among the four traditional methods, K-nearest neighbor(KNN), principal component analysis(PCA), and singular value decomposition(SVD). The survival analysis experiment results show that the data imputed by the VAE has better predictability, and it also proves that DNA Methylation is directly related to cancer survival.

Key words: deep learning, variational auto-encoder, DNA methylation, data imputation, survival analysis

摘要: 针对高通量测序技术因各种原因导致的DNA甲基化测序数据中包含部分缺失值的问题。提出一种基于变分自编码器的DNA甲基化缺失数据填补模型VAE-MethImp。VAE-MethImp是一种深度隐含空间生成模型,由编码层、隐含层和解码层组成,拥有强大的重构输入数据能力。编码层进行均值和方差的推断;隐含层是通过编码层输出的均值和方差计算出的输入数据的专属正态分布;解码层对隐含层包含的特征进行解码生成重构后的数据。通过在肺癌和乳腺癌上的填补实验证明,VAE-MethImp提取的特征更具信息性。在填补精度上,VAE-MethImp比对照方法(均值(Mean)、最近邻(KNN)、主成分分析(PCA)和奇异值分解(SVD))中最优的SVD提升了4.8%。生存分析实验结果显示VAE-MethImp填补的数据具有更好的预测性,同时也证明DNA甲基化与癌症的生存存在直接关联。

关键词: 深度学习, 变分自编码器, DNA甲基化, 数据填补, 生存分析