Building foul words text corpus

Abstract

Abstract: Being un-offical language, foul words are widespread in Web reviews, and have a bad impact on Web civilization. The hazards and characteristics of the foul words are analyzed and described. Focused on the research of Web foul words, this paper designs a method for foul words corpus collection, which is integration of the machine automatically and manually technology. Over 6000 sentences are collected from huge amounts of Web review into a Foul Words Corpus. An automatic identification foul words experiment is done, which based on SVM and Maximum Entropy. The results show that the recall and accuracy are both over 97%.

Key words: foul words, corpus, text classification, automatic identification

摘要： 脏话作为一种非正规的语言现象，在网络评价中已经无处不在，对网络文明造成了影响。描述了脏话文本的特点、定义及其危害，并对网络脏话文本进行了研究与分析，设计了一个机器自动判别与少量人工标注相结合的脏话语料采集方法，借助海量的真实评价文本，构造了一个较大规模的高质量的脏话语料库，初步采集了6 000多句脏话语料。然后利用一元、二元和三元特征，通过SVM与最大熵分类器对脏话的自动分类进行了实验，结果表明，两种分类器的准确率和查全率都达到97%以上。

关键词: 脏话文本, 语料库, 文本分类, 自动识别

ZHU Xiaoxu, QIAN Peide. Building foul words text corpus[J]. Computer Engineering and Applications, 2014, 50(11): 126-129.

朱晓旭，钱培德. 脏话文本语料库建设[J]. 计算机工程与应用, 2014, 50(11): 126-129.

[1]	HUANG Jinjie, LIN Jiangquan, HE Yongjun, HE Jinjie, WANG Yajun. Chinese Short Text Classification Algorithm Based on Local Semantics and Context [J]. Computer Engineering and Applications, 2021, 57(6): 94-100.
[2]	HUO Guangyu, ZHANG Yong, SUN Yanfeng, YIN Baocai. Research on Archive Data Intelligent Classification Based on Semantic [J]. Computer Engineering and Applications, 2021, 57(6): 247-253.
[3]	ZHENG Cheng, DONG Chunyang, HUANG Xiayan. Short Text Classification Method Based on BTM Graph Convolutional Network [J]. Computer Engineering and Applications, 2021, 57(4): 155-160.
[4]	HE Wenliang, ZHU Minling. Research Status and Future Analysis of Capsule Neural Network [J]. Computer Engineering and Applications, 2021, 57(3): 33-43.
[5]	TENG Jinbao, KONG Weiwei, TIAN Qiaoxin, WANG Zhaoqian, LI Long. Multi-channel Attention Mechanism Text Classification Model Based on CNN and LSTM [J]. Computer Engineering and Applications, 2021, 57(23): 154-162.
[6]	WU Shuzhao, LI Gongquan, BU Mingwei. Construction of Question Answering System for Suicide Tendency Detection Based on Knowledge Graph [J]. Computer Engineering and Applications, 2021, 57(22): 304-312.
[7]	LI Tiefei, SHENG Long, WU Di. Study on Text Classification Method of BERT-TECNN Model [J]. Computer Engineering and Applications, 2021, 57(18): 186-193.
[8]	DING Yong, CHENG Jiaqiao, JIANG Cuiqing, WANG Zhao. Comparative Text Classification Method Based on Topic and Keyword Feature [J]. Computer Engineering and Applications, 2021, 57(17): 196-202.
[9]	TENG Jinbao, KONG Weiwei, TIAN Qiaoxin, WANG Zhaoqian. Text Classification Method Based on LSTM-Attention and CNN Hybrid Model [J]. Computer Engineering and Applications, 2021, 57(14): 126-133.
[10]	ZHAI Yiming, WANG Binjun, ZHOU Zhining, TONG Xin. Multi-head Attention Pooling-Based RCNN Model for Text Classification [J]. Computer Engineering and Applications, 2021, 57(12): 155-160.
[11]	YAO Jiaqi, XU Zhengguo, YAN Jikun, WANG Keren. GCN-PU: PU Text Classification Algorithm Based on Graph Convolutional Network [J]. Computer Engineering and Applications, 2021, 57(11): 162-167.
[12]	HAO Chao, QIU Hangping, SUN Yi, ZHANG Chaoran. Research Progress of Multi-label Text Classification [J]. Computer Engineering and Applications, 2021, 57(10): 48-56.
[13]	ZHANG Man, XIA Zhanguo, LIU Bing, ZHOU Yong. Character Level Text Classification Based on Fully Convolutional Neural Network [J]. Computer Engineering and Applications, 2020, 56(5): 166-172.
[14]	TANG Zhuang, WANG Zhishu, ZHOU Ai, FENG Meishan, QU Wen, LU Mingyu. Transformer-Capsule Integrated Model for Text Classification [J]. Computer Engineering and Applications, 2020, 56(24): 151-156.
[15]	FANG Jiongkun, CHEN Pinghua, LIAO Wenxiong. Text Classification Model Based on GloVe and GRU [J]. Computer Engineering and Applications, 2020, 56(20): 98-103.

Building foul words text corpus

脏话文本语料库建设

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics