Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (13): 189-193.DOI: 10.3778/j.issn.1002-8331.1902-0179

Previous Articles     Next Articles

Classification of Cancer Based on Deep Forest and DNA Methylation

LIU Chao, WU Shen, ZHENG Yichao, HOU Weiyan   

  1. School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China
  • Online:2020-07-01 Published:2020-07-02

基于深度森林和DNA甲基化的癌症分类研究

刘超,吴申,郑一超,侯维岩   

  1. 郑州大学 信息工程学院,郑州 450001

Abstract:

As an important epigenetic phenomenon in the human genome, DNA methylation is playing an important regulatory role in gene expression and is closely related to cancer. Aiming at the class imbalance and high dimension of the huge data of The Cancer Genome Atlas(TCGA), the problem of large increase in the false negative rate is proposed. A mixed sampling unbalanced data integration classification algorithm is proposed, which is generated by Synthetic Minority Over-sampling Technique(SMOTE) algorithm. A new sample of the minority class, the extended data set, through the Tomek Link algorithm to eliminate the noise introduced in the sample expansion process, resulting in a relatively balanced data set. On this basis, using the cascade forest structure of the deep forest(gcForest), two random forest structures are chosen for each layer to enhance the generalization ability of the model and obtain the final classification model. Experiments on DNA methylation data of 6 kinds of cancers show that the mixed sampling and unbalanced data integration classification algorithm effectively improve the sensitivity to a few classes under the premise of ensuring the classification accuracy of most classes.

Key words: DNA methylation, The Cancer Genome Atlas(TCGA), Synthetic Minority Oversampling Technique(SMOTE), Tomek Link algorithm, gcForest algorithm

摘要:

作为人类基因组重要的表观遗传现象,DNA甲基化对基因的表达发挥着重要的调控作用,与癌症的关系密切。针对癌症基因组图谱(TCGA)庞大数据的类不平衡和高维度,致使假阴率大幅增加的问题,提出了一种混合采样的不平衡数据集成分类算法,使用合成少数过采样(SMOTE)算法生成新的少数类样本,得到扩充后的数据集,通过Tomek Link算法剔除样本扩充过程中引入的噪声,得到相对平衡的数据集。在此基础上,利用深度森林(gcForest)算法的级联森林结构,每一层选取两种随机森林结构,以增强模型的泛化能力,得到最终的分类模型。对6种癌症的DNA甲基化数据实验表明混合采样的不平衡数据集成分类算法在保证多数类分类精度的前提下,有效地提高了对于少数类的灵敏度。

关键词: DNA甲基化, 癌症基因组图谱(TCGA), 合成少数类采样技术(SMOTE), Tomek Link算法, gcForest算法