融合字词特征的互联网敏感言论识别研究

doi:10.3778/j.issn.1002-8331.2203-0301

摘要/Abstract

摘要： 互联网敏感言论与普通言论之间存在显著差异，为规避过滤规则，其语义较为隐晦，一词多义现象频出，不规范程度较高。为高效识别互联网中的敏感言论并对其进行准确分类，针对敏感言论的特点与现有模型的缺点，对文本卷积神经网络进行了改进，结合ALBERT（a Lite BERT）动态字级编码模型、文本卷积神经网络、多头自注意力机制与门控机制的优势，提出了一种融合字词特征的双通道分类模型ALBERT-CCMHSAG。该模型将文本的字级与词级语义信息、局部关键特征与上下文语义进行了充分提取与融合，以此提升敏感言论的分类效果。ALBERT-CCMHSAG模型在敏感言论数据集上、噪声敏感言论数据集、小样本敏感言论数据集上的表现均为最优，证明了该模型对敏感言论识别与分类能力更强，能应对噪声数据与适应训练数据不足的情况，鲁棒性更强。在酒店评论数据集上，该模型的性能同样优于对比模型，证明了模型在其他语料上也很可能具有优异表现。

关键词: 敏感言论识别, 字特征, 词特征, 多头自注意力机制, 门控机制

Abstract: Sensitive speeches on the Internet are quite different from ordinary speeches. In order to avoid filtering rules, they have a high degree of irregularity, more obscure semantics, and frequent multiple meanings of words. In order to efficiently identify sensitive speeches on the Internet and classify them accurately, according to the characteristics of sensitive speeches and the shortcomings of existing models, the text convolutional neural network is improved. Combining the advantages of ALBERT（a Lite BERT） dynamic character-level encoding model, text convolutional neural network, multi-head self-attention mechanism and gating mechanism, a dual-channel classification model ALBERT-CCMHSAG that combines features of characters and words is proposed. The model fully extracts and integrates the character-levelandword-levelsemantic information, local key features and contextual semantics of the text to improve the classification effect of sensitive speeches. The ALBERT-CCMHSAG model performs optimally on the sensitive speeches dataset, the noisy sensitive speeches dataset, and the small-sample sensitive speeches dataset, proving that the model is more capable of recognizing and classifying sensitive speech, coping with noisy data and adapting to the situation of insufficient training data, and being more robust. The model also outperforms the comparison models on the hotel reviews dataset, demonstrating that the model is likely to perform well in other corpora.

Key words: sensitive speeches recognition, characters features, words features, multi-head self-attention mechanism, gating mechanism

闫尚义, 王靖亚, 朱少武, 崔雨萌, 陶知众. 融合字词特征的互联网敏感言论识别研究[J]. 计算机工程与应用, 2023, 59(13): 129-138.

YAN Shangyi, WANG Jingya, ZHU Shaowu, CUI Yumeng, TAO Zhizhong. Research on Internet Sensitive Speeches Recognition Combining Features of Characters and Words[J]. Computer Engineering and Applications, 2023, 59(13): 129-138.

参考文献

[1] 孔建华.当代中国网络舆情治理：行动逻辑、现实困境与路径选择[D].长春：吉林大学，2019.
KONG J H.Contemporary China network public opinion governance：action logic realistic dilemma and path selection[D].Changchun：Jilin University，2019.
[2] 张昊.目标网站访客舆情信息获取方法研究[D].哈尔滨：哈尔滨工业大学，2017.
ZHANG H.Research on the methods for obtaining public opinion information of visitors of target websites[D].Harbin：Harbin Institute of Technology，2017.
[3] 李扬，潘泉，杨涛.基于短文本情感分析的敏感信息识别[J].西安交通大学学报，2016，50（9）：80-84.
LI Y，PAN Q，YANG T.Sensitive information recognition based on short text sentiment analysis[J]，Journal of Xi’an Jiaotong University，2016，50（9）：80-84.
[4] KOWSARI K，MEIMANDI K J，HEIDARYSAFA M，et al.Text classification algorithms：a survey[J].arXiv：1904. 08067，2019.
[5] KIM Y.Convolutional neural networks for sentence classification[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing（EMNLP），2014：1746-1751.
[6] ZAREMBA W，SUTSKEVER I，VINYALS O.Recurrent neural network regularization[C]//Proceedings of the International Conference on Learning Representations（ICLR），2014.
[7] CHEN W T，FAN C X，WU Y X，et al.A Chinese character-level and word-level complementary text classification method[C]//Proceedings of the International Conference on Technologies and Applications of Artificial Intelligence（TAAI），2020：187-192.
[8] ZHANG Y，WALLACE B C.A Sensitivity analysis of（and practitioners’guide to） convolutional neural networks for sentence classification[C]//Proceedings of International Joint Conference on Natural Language Processing（IJCNLP），2017.
[9] LAN Z，CHEN M，GOODMAN S，et al.ALBERT：a lite BERT for self-supervised learning of language representations[C]//Proceedings of the International Conference on Learning Representations（ICLR），2020.
[10] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Proceedings of the 31st Conference on Neural Information Processing Systems（NIPS），Long Beach，CA，USA，2017.
[11] MIKOLOV T，CHEN K，CORRADO G S，et al.Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations（ICLR），2013.
[12] DEVLIN J，CHANG M，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics（ACL），2019.
[13] 高广尚.深度学习推荐模型中的注意力机制研究综述[J].计算机工程与应用，2022，58（9）：9-18.
GAO G S.Survey on attention mechanisms in deep learning recommendation models[J].Computer Engineering and Applications，2022，58（9）：9-18.
[14] BAHDANAU D，CHO K，BENGIO Y.Neural machine translation by jointly learning to align and translate[C]//Proceedings of the International Conference on Learning Representations（ICLR），2015.
[15] 卢琪，潘志松，谢钧.融合知识表示学习的双向注意力问答模型[J].计算机工程与应用，2021，57（23）：171-177.
LU Q，PAN Z S，XIE Y.Bidirectional attention question answering model combining knowledge representation learning[J].Computer Engineering and Applications，2021，57（23）：171-177.
[16] WANG B N，LIU K，ZHAO J.Inner attention based recurrent neural networks for answer selection[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics（ACL），2016：1288-1297.
[17] 袁勋，刘蓉，刘明.融合多层注意力的方面级情感分析模型[J].计算机工程与应用，2021，57（22）：147-152.
YUAN X，LIU R，LIU M.Aspect-level sentiment analysis model incorporating multi-layer attention[J].Computer Engineering and Applications，2021，57（22）：147-152.
[18] YANG Z C，YANG D Y，DYER C，et al.Hierarchical attention networks for document classification[C]//Annual Conference of the North American Chapter of the Association for Computational Linguistics（NAACL），2016：1480-1489.
[19] WANG L L，CAO Z，MELO G，et al.Relation classification via multi-level attention CNNs[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics（ACL），2016：1298-1307.
[20] 杨兴锐，赵寿为，张如学，等.结合自注意力和残差的BiLSTM_CNN文本分类模型[J].计算机工程与应用，2022，58（3）：172-180.
YANG X R，ZHAO S W，ZHANG R X，et al.BiLSTM_ CNN classification model based on self-attention and residual network[J].Computer Engineering and Applications，2022，58（3）：172-180.
[21] 石磊，王毅，成颖，等.自然语言处理中的注意力机制研究综述[J].数据分析与知识发现，2020，4（5）：1-14.
SHI L，WANG Y，CHENG Y，et al.Review of attention mechanism in natural language processing[J].Data Analysis and Knowledge Discovery，2020，4（5）：1-14.
[22] RONRAN C，LEE S.Effect of character and word features in bidirectional LSTM-CRF for NER[C]//Proceedings of the IEEE International Conference on Big Data and Smart Computing（BigComp），2020：613-616.
[23] TONG X，WANG J Y，JIAO K N，et al.Robustness detection method of Chinese spam based on the features of joint characters-words[C]//Proceedings of the 10th International Conference on Computer Engineering and Networks，2021：845-851.
[24] WANG Y Q，HUANG M L，ZHAO L，et al.Attention-based LSTM for aspect-level sentiment classification[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing（EMNLP），2016：606-615.
[25] ZHOU P，SHI W，TIAN J，et al.Attention-based bidirectional long short-term memory networks for relation classification[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics（ACL），2016：207-212.