Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (10): 233-240.DOI: 10.3778/j.issn.1002-8331.2002-0212

Previous Articles     Next Articles

One-Class Classification Method for High-Dimensional Mixed and Unbalanced Credit Score Data

ZHANG Dongmei, Mairidan Wushouer, Gulanbaier Tuerhong   

  1. College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
  • Online:2021-05-15 Published:2021-05-10

面向高维混合不平衡信贷数据的单类分类方法

张东梅,买日旦·吾守尔,古兰拜尔·吐尔洪   

  1. 新疆大学 信息科学与工程学院,乌鲁木齐 830046

Abstract:

To conduct an accurate prediction of “bad” loan applicants in high-dimensional, mixed and unbalanced credit score data, this paper proposes a one-class [KNN][(K]-Nearest Neighbor) algorithm based on Principal Component Analysis of Mixed Data processing(PCAmix), in which both the preprocessing of dimension reduction and classification itself are optimized. Since the traditional Principal Component Analysis(PCA) methods cannot deal with qualitative variables directly, this paper not only employs the PCAmix, but also incorporates the concept of one-class classification and average distance calculation to avoid the poor performance of binary classification on unbalanced data. Besides, the proposed method adopts the Bootstrap algorithm to find the best decision boundaries that maximize the separation of positive and negative samples to accomplish accurate predicting for customer’s default risk. The experiments on UCI datasets of German and Default credit score show that the proposed algorithm performs better when the data are high-dimensional, mixed as well as unbalanced.

Key words: credit score, one-class classification, imbalance data, high-dimensional mixed data, Principal Component Analysis of Mixed Data(PCAmix)

摘要:

为实现对高维混合、不平衡信贷数据中的不良贷款者的准确预测,从降维预处理和分类算法两方面进行优化,提出一种基于混合数据主成分分析(Principal Component Analysis of Mixed Data,PCAmix)预处理的单类[K]近邻[(K]-Nearest Neighbor,[KNN)]计算均值算法。针对传统的主成分分析(Principal Component Analysis,PCA)不能直接处理定性变量的问题,使用PCAmix降维预处理数据,为规避不平衡数据在二分类模型中性能较差的缺点,采用单类分类和[K]近邻算法邻居计算的思想,仅采用多数类训练模型。利用Bootstrap方法找到最佳的决策边界,使得正负样本最大限度地分离,最终准确预测客户的违约风险。采用UCI数据库中的German和Default个人信用评分数据集进行验证,实验结果表明该算法在处理高维混合、不平衡的信贷数据上具有较好的分类效果。

关键词: 信用评分, 单类分类, 不平衡数据, 高维混合数据, 混合数据主成分分析