计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (17): 136-141.DOI: 10.3778/j.issn.1002-8331.1908-0337

• 模式识别与人工智能 • 上一篇    下一篇

基于XGBoost特征选择的疾病诊断XLC-Stacking方法

岳鹏,侯凌燕,杨大利,佟强   

  1. 北京信息科技大学 计算机开放系统实验室,北京 100101
  • 出版日期:2020-09-01 发布日期:2020-08-31

XLC-Stacking Method for Disease Diagnosis Based on XGBoost Feature Selection

YUE Peng, HOU Lingyan, YANG Dali, TONG Qiang   

  1. Open Computer System Laboratory, Beijing Information Science and Technology University, Beijing 100101, China
  • Online:2020-09-01 Published:2020-08-31

摘要:

针对医学疾病数据中存在特征冗余的问题,以XGBoost特征选择方法度量特征重要度,删除冗余特征,选择最佳分类特征;针对识别精度不高的问题,使用Stacking方法集成XGBoost、LightGBM等多种异质分类器,并在异质分类器中引入性能更好的CatBoost分类器提升集成分类器分类精度。为了避免过拟合,选择基层分类器输出的分类概率作为高层分类器输入。实验结果表明,提出的基于XGBoost特征选择的XLC-Stacking方法相比当前主流分类算法以及单一的XGBoost算法和Stacking方法有较大提升,识别的准确率和F1-Score达到97.73%和98.21%,更加适用于疾病的诊断。

关键词: 疾病诊断, 特征选择, XGBoost, CatBoost, Stacking

Abstract:

Aiming at the problem of feature redundancy in medical disease data, XGBoost feature selection method is used to measure feature importance, delete redundant features, and select the best classification features. For the problem of low recognition accuracy, Stacking method is used to integrate XGBoost, LightGBM and other heterogeneous classifiers, and a better CatBoost classifier is introduced into the heterogeneous classifier to improve the classification accuracy of the integrated classifier. To avoid overfitting, the classification probability of the output of the base classifier is chosen as the high level classifier input. Experimental results show that the XLC-Stacking method based on XGBoost feature selection is greatly improved compared with the current mainstream classification algorithm and the single XGBoost algorithm and Stacking method. The accuracy of recognition and F1-Score reach 97.73% and 98.21%, which is even more suitable for the diagnosis of disease.

Key words: disease diagnosis, feature selection, XGBoost, CatBoost, Stacking