计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (17): 121-128.DOI: 10.3778/j.issn.1002-8331.1906-0149

• 网络、通信与安全 • 上一篇    下一篇

通信垃圾文本识别的半监督学习优化算法

邱宁佳,沈卓睿,王辉,王鹏   

  1. 长春理工大学 计算机科学技术学院,长春 130022
  • 出版日期:2020-09-01 发布日期:2020-08-31

Semi-supervised Learning Optimization Algorithm for Communication Spam Text Recognition

QIU Ningjia, SHEN Zhuorui, WANG Hui, WANG Peng   

  1. School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China
  • Online:2020-09-01 Published:2020-08-31

摘要:

在对非平衡通信文本使用随机下采样来提高分类器性能时,为了解决随机下采样样本发生有偏估计的问题,提出基于否定选择密度聚类的下采样算法(NSDC-DS)。利用否定选择算法的自体异常检测机制改善传统聚类,将样本中心点和待聚类样本分别作为检测器和自体集,对两者进行异常匹配;使用否定选择密度聚类算法对样本相似性进行评估,改进传统的下采样方法,使用NBSVM分类器对采样后的通信样本进行垃圾识别;使用PCA对样本所具有的信息量进行评估,提出改进的PCA-SGD算法对模型参数进行调优,完成通信垃圾文本的半监督识别任务。为了验证改进算法的优越性,使用不平衡通信文本等多个数据集,在否定选择密度聚类、NSDC-DS算法、PCA-SGD与传统模型上进行对比分析。实验结果表明,改进的模型不仅具有较好的通信垃圾文本识别能力,而且具有较快和稳定的收敛速度。

关键词: 非平衡数据, 垃圾文本识别, 否定选择密度聚类, 基于否定选择密度聚类的下采样算法(NSDC-DS), 基于主成分分析的随机梯度下降(PCA-SGD)算法

Abstract:

In order to solve the problem of biased estimation of random samples, when using random under-sampling to improve the classifier performance for unbalanced communication samples, a Down-Sampling algorithm based on Negative Selection Density Clustering(NSDC-DS) is proposed. Firstly, the autogenous anomaly detection mechanism of negative selection algorithm is used to improve the traditional clustering, and the two are matched abnormally. The sampled communication samples are recognized with the NBSVM classifier. Then the negative selection clustering algorithm is used to evaluate the similarity of samples and improve the traditional down-sampling method. Finally, PCA is used to evaluate the information content of samples, and an improved PCA-SGD algorithm is proposed to tune model parameters and complete the semi-supervised recognition task of communication spam text. In order to verify the superiority of the improved algorithm, multiple data sets such as unbalanced communication text are used to compare and analyze the negative selection cluster, NSBC-US, PCA-SGD and the traditional model. Experimental results show that the improved model not only has good communication spam text recognition ability, but also has fast and stable convergence speed.

Key words: unbalanced data, spam text recognition, negative selection density clustering, Down-Sampling algorithm based on Negative Selection Density Clustering(NSDC-DS), Stochastic Gradient Descent based on Principal Component Analysis(PCA-SGD) algorithm