计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (13): 129-138.DOI: 10.3778/j.issn.1002-8331.2203-0301

• 模式识别与人工智能 • 上一篇    下一篇

融合字词特征的互联网敏感言论识别研究

闫尚义,王靖亚,朱少武,崔雨萌,陶知众   

  1. 中国人民公安大学 信息网络安全学院,北京 100045
  • 出版日期:2023-07-01 发布日期:2023-07-01

Research on Internet Sensitive Speeches Recognition Combining Features of Characters and Words

YAN Shangyi, WANG Jingya, ZHU Shaowu, CUI Yumeng, TAO Zhizhong   

  1. School of Information Network Security, People’s Public Security University of China, Beijing 100045, China
  • Online:2023-07-01 Published:2023-07-01

摘要: 互联网敏感言论与普通言论之间存在显著差异,为规避过滤规则,其语义较为隐晦,一词多义现象频出,不规范程度较高。为高效识别互联网中的敏感言论并对其进行准确分类,针对敏感言论的特点与现有模型的缺点,对文本卷积神经网络进行了改进,结合ALBERT(a Lite BERT)动态字级编码模型、文本卷积神经网络、多头自注意力机制与门控机制的优势,提出了一种融合字词特征的双通道分类模型ALBERT-CCMHSAG。该模型将文本的字级与词级语义信息、局部关键特征与上下文语义进行了充分提取与融合,以此提升敏感言论的分类效果。ALBERT-CCMHSAG模型在敏感言论数据集上、噪声敏感言论数据集、小样本敏感言论数据集上的表现均为最优,证明了该模型对敏感言论识别与分类能力更强,能应对噪声数据与适应训练数据不足的情况,鲁棒性更强。在酒店评论数据集上,该模型的性能同样优于对比模型,证明了模型在其他语料上也很可能具有优异表现。

关键词: 敏感言论识别, 字特征, 词特征, 多头自注意力机制, 门控机制

Abstract: Sensitive speeches on the Internet are quite different from ordinary speeches. In order to avoid filtering rules, they have a high degree of irregularity, more obscure semantics, and frequent multiple meanings of words. In order to efficiently identify sensitive speeches on the Internet and classify them accurately, according to the characteristics of sensitive speeches and the shortcomings of existing models, the text convolutional neural network is improved. Combining the advantages of ALBERT(a Lite BERT) dynamic character-level encoding model, text convolutional neural network, multi-head self-attention mechanism and gating mechanism, a dual-channel classification model ALBERT-CCMHSAG that combines features of characters and words is proposed. The model fully extracts and integrates the character-levelandword-levelsemantic information, local key features and contextual semantics of the text to improve the classification effect of sensitive speeches. The ALBERT-CCMHSAG model performs optimally on the sensitive speeches dataset, the noisy sensitive speeches dataset, and the small-sample sensitive speeches dataset, proving that the model is more capable of recognizing and classifying sensitive speech, coping with noisy data and adapting to the situation of insufficient training data, and being more robust. The model also outperforms the comparison models on the hotel reviews dataset, demonstrating that the model is likely to perform well in other corpora.

Key words: sensitive speeches recognition, characters features, words features, multi-head self-attention mechanism, gating mechanism