计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (16): 235-240.DOI: 10.3778/j.issn.1002-8331.1704-0408

• 工程与应用 • 上一篇    下一篇

PCA与随机森林相结合筛选高信息量SNP位点——应用于羊的品种鉴别

刘月丽1,覃锡忠1,贺三刚2,李文蓉2,王  悦1,贾振红1,刘明军2   

  1. 1.新疆大学 信息科学与工程学院,乌鲁木齐 830046
    2.新疆畜牧科学院 生物技术研究所,农业部草食家畜繁育生物技术重点开放实验室,新疆维吾尔自治区动物生物技术重点实验室,乌鲁木齐 830046
  • 出版日期:2018-08-15 发布日期:2018-08-09

Combined use of PCA and random forests to identify high-informative SNP loci:application in sheep population identification

LIU Yueli1, QIN Xizhong1, HE Sangang2, LI Wenrong2, WANG Yue1, JIA Zhenhong1, LIU Mingjun2   

  1. 1.College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
    2.The Key Laboratory of Livestock Reproduction & Breed Biotechnology of MOA, The Key Laboratory of Animal Biotechnology of Xinjiang, Xinjiang Academy of Biotechnological Center, Urumqi 830046, China
  • Online:2018-08-15 Published:2018-08-09

摘要: 针对品种鉴别中面临的SNP(Single Nucleotide Polymorphisms)数据高维小样本的难点,研究利用少数高信息量SNP位点正确鉴别品种的方法,提出了一种新的SNP位点筛选方法。先利用PCA提取SNP主要位点,随后使用随机森林方法,根据平均精度下降和Gini指数下降对主位点的重要性进行评估,训练分类模型。最后分别选取重要度排名前48和96的位点,以这些位点为分类特征,建立分类模型进行品种鉴别。将该模型应用于6种绵羊Illumina OvineSNP50的SNP数据。实验表明,可以从46 013个位点中分别筛选出49、96个高信息量位点用于品种鉴别,鉴别准确率达到97%以上。该方法减少了用于品种鉴别的SNP位点个数,降低了品种鉴别成本。

关键词: 主成分分析(PCA), 随机森林, 高信息量SNP位点, 品种鉴别

Abstract: In order to overcome the difficulties of high dimensional small sample of SNP(Single Nucleotide Polymorphisms) data, while reducing the number of SNP loci for variety identification and meeting the accuracy requirements, this paper proposes a new screening method for SNP loci. Firstly, the main SNP is extracted by PCA, then the random forest with mean accuracy decrease and mean decrease in the Gini index is used to assess the importance of main SNP and train classification model. Finally, the top 48 and 96 sites of importance are selected, these sites as classification feature, classification model is used for variety identification. The model is applied to SNP data from Illumina OvineSNP50 on 6 different sheep breeds. Experiments show that this approach can cut the number of loci from 46013 reduced to 49 or 96, and ultimately the results of classification accuracy can reach more than 97%. The proposed method reduces the number of SNP loci for species identification, and reduces the cost of variety identification.

Key words: Principal Component Analysis(PCA), random forests, high-informative SNP loci, variety identification