计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (14): 1-7.DOI: 10.3778/j.issn.1002-8331.1905-0068

• 热点与综述 • 上一篇    下一篇

面向类不平衡问题的“职业举报人”识别方法

易成岐1,黄倩倩1,王从余2,张何灿3,靳晓锟4,王建冬1   

  1. 1.国家信息中心 大数据发展部,北京 100045
    2.清华大学 心理学系,北京 100084
    3.北京大学 软件与微电子学院,北京 102600
    4.北京大学 数学科学学院,北京 100871
  • 出版日期:2019-07-15 发布日期:2019-07-11

Identification Method of “Professional Whistleblower” Based on Class Imbalance Problem

YI Chengqi1, HUANG Qianqian1, WANG Congyu2, ZHANG Hecan3, JIN Xiaokun4, WANG Jiandong1   

  1. 1.Department of Big Data Development, State Information Center, Beijing 100045, China
    2.Department of Psychology, Tsinghua University, Beijing 100084, China
    3.School of Software and Microelectronics, Peking University, Beijing 102600, China
    4.School of Mathematical Sciences, Peking University, Beijing 100871, China
  • Online:2019-07-15 Published:2019-07-11

摘要: “职业举报人”团伙化、规模化、专业化、低龄化作案趋势日趋明显,政府部门对其识别大多采用人工鉴别的方法,造成了大量人力资源的浪费。采用Bootstrapping数据重采样技术,结合文本、时间和举报人属性等特征,在解决类不平衡数据的过拟合问题基础上,实现了“职业举报人”的准确识别。实验结果表明,相比过采样和欠采样技术而言,利用Bootstrapping重采样技术识别准确率更高,采用CFS方法结合BestFirst策略对数据特征进行优化,在保证精度的前提下能够实现更高的计算效率。以全国12358价格监管平台的真实数据为驱动,验证了方法的有效性,对比分析了“职业举报人”和正常消费者的投诉举报行为习惯差异。

关键词: 职业举报人, 类不平衡, 特征选择, 数据驱动, 12358价格监管平台

Abstract: “Professional whistleblower” is a problem that has perplexed market regulators for many years, and with the trend of gangs, large-scale, professional and low-age. Most of the government departments take the manual identification methods to identify “professional whistleblower”, which uses up much labor power. This paper uses the statistical technique “bootstrapping”, combined with the characteristics of text, time and whistleblower attributes, on the basis of solving the problem of over-fitting of class unbalanced data, the accurate identification of “professional whistleblower” is realized. The experimental results show that:the recognition accuracy of “bootstrapping” is higher than that of other resampling methods such as “oversampling” and “undersampling”, the correlation-based feature selection method combined with the best first search strategy to optimize the data features in the identification method has higher computational efficiency on the premise of ensuring the accuracy. By the real-world data-driven of “national 12358 price regulation platform”, this paper verifies the effectiveness of the method. Finally, this paper compares and analyzes the differences of the behaviors between professional whistleblower and normal consumers.

Key words: professional whistleblower, class imbalance, feature selection, data driven, 12358 price regulation platform