Combined use of PCA and random forests to identify high-informative SNP loci：application in sheep population identification

doi:10.3778/j.issn.1002-8331.1704-0408

Computer Engineering and Applications ›› 2018, Vol. 54 ›› Issue (16): 235-240.DOI: 10.3778/j.issn.1002-8331.1704-0408

Previous Articles Next Articles

Combined use of PCA and random forests to identify high-informative SNP loci：application in sheep population identification

LIU Yueli1, QIN Xizhong1, HE Sangang2, LI Wenrong2, WANG Yue1, JIA Zhenhong1, LIU Mingjun2

1.College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
2.The Key Laboratory of Livestock Reproduction & Breed Biotechnology of MOA, The Key Laboratory of Animal Biotechnology of Xinjiang, Xinjiang Academy of Biotechnological Center, Urumqi 830046, China

Online:2018-08-15 Published:2018-08-09

PCA与随机森林相结合筛选高信息量SNP位点——应用于羊的品种鉴别

刘月丽1，覃锡忠1，贺三刚2，李文蓉2，王悦1，贾振红1，刘明军2

1.新疆大学信息科学与工程学院，乌鲁木齐 830046
2.新疆畜牧科学院生物技术研究所，农业部草食家畜繁育生物技术重点开放实验室，新疆维吾尔自治区动物生物技术重点实验室，乌鲁木齐 830046

Abstract

Abstract: In order to overcome the difficulties of high dimensional small sample of SNP（Single Nucleotide Polymorphisms） data, while reducing the number of SNP loci for variety identification and meeting the accuracy requirements, this paper proposes a new screening method for SNP loci. Firstly, the main SNP is extracted by PCA, then the random forest with mean accuracy decrease and mean decrease in the Gini index is used to assess the importance of main SNP and train classification model. Finally, the top 48 and 96 sites of importance are selected, these sites as classification feature, classification model is used for variety identification. The model is applied to SNP data from Illumina OvineSNP50 on 6 different sheep breeds. Experiments show that this approach can cut the number of loci from 46013 reduced to 49 or 96, and ultimately the results of classification accuracy can reach more than 97%. The proposed method reduces the number of SNP loci for species identification, and reduces the cost of variety identification.

Key words: Principal Component Analysis（PCA）, random forests, high-informative SNP loci, variety identification

摘要： 针对品种鉴别中面临的SNP（Single Nucleotide Polymorphisms）数据高维小样本的难点，研究利用少数高信息量SNP位点正确鉴别品种的方法，提出了一种新的SNP位点筛选方法。先利用PCA提取SNP主要位点，随后使用随机森林方法，根据平均精度下降和Gini指数下降对主位点的重要性进行评估，训练分类模型。最后分别选取重要度排名前48和96的位点，以这些位点为分类特征，建立分类模型进行品种鉴别。将该模型应用于6种绵羊Illumina OvineSNP50的SNP数据。实验表明，可以从46 013个位点中分别筛选出49、96个高信息量位点用于品种鉴别，鉴别准确率达到97%以上。该方法减少了用于品种鉴别的SNP位点个数，降低了品种鉴别成本。

关键词: 主成分分析（PCA）, 随机森林, 高信息量SNP位点, 品种鉴别

LIU Yueli1, QIN Xizhong1, HE Sangang2, LI Wenrong2, WANG Yue1, JIA Zhenhong1, LIU Mingjun2. Combined use of PCA and random forests to identify high-informative SNP loci：application in sheep population identification[J]. Computer Engineering and Applications, 2018, 54(16): 235-240.

刘月丽1，覃锡忠1，贺三刚2，李文蓉2，王悦1，贾振红1，刘明军2. PCA与随机森林相结合筛选高信息量SNP位点——应用于羊的品种鉴别[J]. 计算机工程与应用, 2018, 54(16): 235-240.

[1]	YANG Yemin, ZHANG Huijun, ZHANG Xiaolong. Research on Interpretable Visual Analysis Method of Random Forest [J]. Computer Engineering and Applications, 2021, 57(6): 168-175.
[2]	YU Duo, HUANG Yongdong. Hyperspectral Image Classification Based on SPCA and Domain Transform Recursive Filtering [J]. Computer Engineering and Applications, 2021, 57(4): 199-208.
[3]	LIN Kezheng, ZHANG Yuanming, LI Haotian. Research on HOG Feature Extraction Algorithm Weighted by Information Entropy [J]. Computer Engineering and Applications, 2020, 56(6): 147-152.
[4]	HUANG Guangjun, DENG Yuanlong. Polarizer Visual Defect Detection and Classification Based on Improved LBP and SVM Algorithm [J]. Computer Engineering and Applications, 2020, 56(22): 251-255.
[5]	CHEN Jia, LIU Dongxue, WU Dashuo. Stock Index Forecasting Method Based on Feature Selection and LSTM Model [J]. Computer Engineering and Applications, 2019, 55(6): 108-112.
[6]	XU Jingze, WU Zuohong, XU Yan, ZENG Jianhang. Face Recognition Based on PCA，LDA and SVM Algorithms [J]. Computer Engineering and Applications, 2019, 55(18): 34-37.
[7]	MA Yichao, ZHAO Yunji, ZHANG Xinliang. CNN Handwritten Digital Recognition Algorithm Based on PCA Initialization Convolution Kernel [J]. Computer Engineering and Applications, 2019, 55(13): 134-139.
[8]	DENG Qingwen, LIN Zhixian, GUO Tailiang. Multi table image hash retrieval method based on principal component [J]. Computer Engineering and Applications, 2018, 54(3): 192-199.
[9]	LIU Pengrui, SONG Lipeng. Using dimension reduction approach to identify malicious JavaScript [J]. Computer Engineering and Applications, 2018, 54(21): 20-24.
[10]	LI Yanan, ZHANG Xuefeng. Cancelable palmprint template method based on secure sketch [J]. Computer Engineering and Applications, 2018, 54(18): 115-120.
[11]	YANG Xingyu, LI Huaping, ZHANG Yubo. Collaborative filtering algorithm based on clustering and random forests [J]. Computer Engineering and Applications, 2018, 54(16): 152-157.
[12]	JIANG Yan1, SHUAI Renjun1, ZHANG Shu2, ZHA Daifeng3. Prediction for fasting blood glucose level of health records based on KPCA-LSSVM [J]. Computer Engineering and Applications, 2018, 54(13): 241-245.
[13]	LUO Fuli, LI Jiatian. AdaBoost hierarchical enhancement algorithm combined with moving features [J]. Computer Engineering and Applications, 2017, 53(7): 154-159.
[14]	HU Liqiao1，2, QIU Runhe1，2. Face recognition based on adaptively weighted HOG [J]. Computer Engineering and Applications, 2017, 53(3): 164-168.
[15]	ZHAO Pengfei, ZHOU Shaoguang, YI Yang, HU Yiqun. Classification method of hyperspectral remote sensing image based on SLIC and active learning [J]. Computer Engineering and Applications, 2017, 53(3): 183-187.

Combined use of PCA and random forests to identify high-informative SNP loci：application in sheep population identification

PCA与随机森林相结合筛选高信息量SNP位点——应用于羊的品种鉴别

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics