计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (11): 126-129.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

脏话文本语料库建设

朱晓旭,钱培德   

  1. 苏州大学 计算机科学与技术学院,江苏 苏州 215006
  • 出版日期:2014-06-01 发布日期:2015-04-08

Building foul words text corpus

ZHU Xiaoxu, QIAN Peide   

  1. School of Computer Science & Technology, Soochow University, Suzhou, Jiangsu 215006, China
  • Online:2014-06-01 Published:2015-04-08

摘要: 脏话作为一种非正规的语言现象,在网络评价中已经无处不在,对网络文明造成了影响。描述了脏话文本的特点、定义及其危害,并对网络脏话文本进行了研究与分析,设计了一个机器自动判别与少量人工标注相结合的脏话语料采集方法,借助海量的真实评价文本,构造了一个较大规模的高质量的脏话语料库,初步采集了6 000多句脏话语料。然后利用一元、二元和三元特征,通过SVM与最大熵分类器对脏话的自动分类进行了实验,结果表明,两种分类器的准确率和查全率都达到97%以上。

关键词: 脏话文本, 语料库, 文本分类, 自动识别

Abstract: Being un-offical language, foul words are widespread in Web reviews, and have a bad impact on Web civilization. The hazards and characteristics of the foul words are analyzed and described. Focused on the research of Web foul words, this paper designs a method for foul words corpus collection, which is integration of the machine automatically and manually technology. Over 6000 sentences are collected from huge amounts of Web review into a Foul Words Corpus. An automatic identification foul words experiment is done, which based on SVM and Maximum Entropy. The results show that the recall and accuracy are both over 97%.

Key words: foul words, corpus, text classification, automatic identification