计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (11): 126-129.
• 数据库、数据挖掘、机器学习 • 上一篇 下一篇
朱晓旭,钱培德
出版日期:
发布日期:
ZHU Xiaoxu, QIAN Peide
Online:
Published:
摘要: 脏话作为一种非正规的语言现象,在网络评价中已经无处不在,对网络文明造成了影响。描述了脏话文本的特点、定义及其危害,并对网络脏话文本进行了研究与分析,设计了一个机器自动判别与少量人工标注相结合的脏话语料采集方法,借助海量的真实评价文本,构造了一个较大规模的高质量的脏话语料库,初步采集了6 000多句脏话语料。然后利用一元、二元和三元特征,通过SVM与最大熵分类器对脏话的自动分类进行了实验,结果表明,两种分类器的准确率和查全率都达到97%以上。
关键词: 脏话文本, 语料库, 文本分类, 自动识别
Abstract: Being un-offical language, foul words are widespread in Web reviews, and have a bad impact on Web civilization. The hazards and characteristics of the foul words are analyzed and described. Focused on the research of Web foul words, this paper designs a method for foul words corpus collection, which is integration of the machine automatically and manually technology. Over 6000 sentences are collected from huge amounts of Web review into a Foul Words Corpus. An automatic identification foul words experiment is done, which based on SVM and Maximum Entropy. The results show that the recall and accuracy are both over 97%.
Key words: foul words, corpus, text classification, automatic identification
朱晓旭,钱培德. 脏话文本语料库建设[J]. 计算机工程与应用, 2014, 50(11): 126-129.
ZHU Xiaoxu, QIAN Peide. Building foul words text corpus[J]. Computer Engineering and Applications, 2014, 50(11): 126-129.
0 / 推荐
导出引用管理器 EndNote|Ris|BibTeX
链接本文: http://cea.ceaj.org/CN/
http://cea.ceaj.org/CN/Y2014/V50/I11/126