计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (15): 141-150.DOI: 10.3778/j.issn.1002-8331.2204-0147

• 模式识别与人工智能 • 上一篇    下一篇

基于命名实体识别的违法广告词检测方法

袁子博,姚涛,闫连山   

  1. 1.西南交通大学 信息科学与技术学院,成都 611756
    2.西南交通大学 烟台新一代信息技术研究院,山东 烟台 264001
  • 出版日期:2023-08-01 发布日期:2023-08-01

Detection Method of Illegal Advertising Words Based on Named Entity Recognition

YUAN Zibo, YAO Tao, YAN Lianshan   

  1. 1.School of Information Science and Technology, Southwest Jiaotong University, Chengdu 611756, China
    2.Yantai Research Institute of New Generation Information Technology, Southwest Jiaotong University, Yantai, Shandong 264001, China
  • Online:2023-08-01 Published:2023-08-01

摘要: 伴随着广告的普及,违法广告检测的意义重大。面对目前已有的违法广告检测方法仅能判断是否违法,而不能提取违法词并关联违法法条的问题,提出一种基于命名实体识别的违法广告词检测方法,此方法将触及违法条例关键词作为特殊的实体进行识别。具体包括以下步骤:使用BERT(bidirectional encoder representation from transformers)预训练模型提取动态字向量作为模型的输入,构建双向长短期记忆网络获取广告文本上下文的信息输出得分向量,最后结合条件随机场对标签进行限制,获取到最优标签。实验结果表明,这种基于命名实体识别的检测方法可有效地完成违法广告的检测,不仅可以提取违法词,而且还可识别出违法词触犯的相应法条。

关键词: 广告, 违法关键词, 深度学习, 命名实体识别, 条件随机场

Abstract: With the popularization of advertising, the detection of illegal advertising is of great significance. In the face of the existing illegal advertising detection methods can only judge whether illegal or not, but cannot extract illegal words and associate the illegal regulations, an illegal advertising detection method based on named entity recognition is proposed in this paper. The illegal keywords that touch the illegal regulations are regarded as entities which are informal. Firstly, the BERT(bidirectional encoder representations from transformers) pre-training model will provide dynamic vector representations of common Chinese words. Then, the word vector is fed into the BiLSTM(bi-directional long short-term memory) combined with contextual information to obtain the characteristics of each word. Finally, the best labels can be predicted by constraining the labels through the CRF(conditional random field). The experimental results show that the detection method based on named entity recognition can effectively complete the detection of the illegal advertising, which not only can extract illegal words but also can identify the corresponding legal regulations that the illegal words break.

Key words: advertising, illegal key words, deep learning, named entity recognition(NER), conditional random field(CRF)