计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (9): 247-254.DOI: 10.3778/j.issn.1002-8331.2002-0082

• 工程与应用 • 上一篇    下一篇

WKAG:一种针对不平衡医保数据的欺诈检测方法

吴文龙,周喜,王轶,王保全   

  1. 1.中国科学院 新疆理化技术研究所,乌鲁木齐 830011
    2.中国科学院大学,北京 100049
    3.新疆民族语音语言信息处理实验室,乌鲁木齐 830011
  • 出版日期:2021-05-01 发布日期:2021-04-29

WKAG:Fraud Detection Method for Imbalanced Medical Insurance Data

WU Wenlong, ZHOU Xi, WANG Yi, WANG Baoquan   

  1. 1.Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
    2.University of Chinese Academy of Sciences, Beijing 100049, China
    3.Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China
  • Online:2021-05-01 Published:2021-04-29

摘要:

医保欺诈检测具有迫切的现实意义,当前工作主要以机器学习方法为主,但面临两个重要问题:(1)数据不平衡问题较为突出,欺诈样本占比极小,影响识别效果;(2)数据特征的选取与构造过于依赖领域业务知识,难以保证特征有效性。针对这些问题,提出了一种针对不平衡医保数据的欺诈检测方法——WKAG。使用WGAN-KDE(Wasserstein Generative Adversarial Network-Kernel Density Estimation)方法改善数据不平衡问题,结合自编码器(Auto-Encoder)提取数据的深层隐藏特征,使用Gradient Boosted Decision Tree(GBDT)检测医保欺诈行为。在多个公开数据集上验证了该方法有效性,并在真实医保业务数据集上进行了实验验证,结果表明了WKAG可作为医保欺诈行为的有效检测方法。

关键词: 生成对抗网络, 不平衡类, 自编码特征表示, 医保欺诈检测, 集成学习

Abstract:

Medical insurance fraud detection has urgent practical significance. The current work is mainly concentrated on machine learning methods and confronted with two important issues:(1)The problem of imbalanced data is prominent and the proportion of fraud data among medical insurance data is extremely small, which affects the identification effect; (2)The selection and construction of data features depend on domain business knowledge, and it is difficult to guarantee the validity of features. Aiming at these problems, this paper proposes a fraud detection method for imbalanced healthcare data—WKAG:The Wasserstein Generative Adversarial Network-Kernel Density Estimation(WGAN-KDE) method is used to improve the imbalance of medical insurance data. The Auto-Encoder is used to extract the deep hidden features of data. The Gradient Boosted Decision Tree(GBDT) is used to detect medical insurance fraud. The validity of the method has been verified on multiplepublic data sets as well as the real medical insurance business data set. The results show that WKAG can be used as an effective detection method for medical insurance fraud.

Key words: generative adversarial network, imbalance dataset, auto-encoder feature representation, medical insurance fraud detection, ensemble learning