计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (10): 136-140.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于LASVM-NC和TF.RF的文本分类方法

李玉鑑,李玉雄,冷强奎   

  1. 北京工业大学 计算机学院,北京 100124
  • 出版日期:2014-05-15 发布日期:2014-05-14

Text classification method based on non-convex online support vector machines and term frequency relevance frequency roduct

LI Yujian, LI Yuxiong, LENG Qiangkui   

  1. College of Computer Science, Beijing University of Technology, Beijing 100124, China
  • Online:2014-05-15 Published:2014-05-14

摘要: 非凸在线支持向量机(LASVM-NC)具有抗噪能力强和训练速度快的优点,而词频相关频率积(tf.rf)则是一种自适应能力很强、分类性能非常好的文本特征。通过把非凸在线支持向量机和词频相关频率积相结合,提出了一种新的文本分类方法,即LASVM-NC+tf.rf。实验结果表明,这种方法在LASVM-NC与多种其他特征的结合中性能是最好的,且与SVM+tf.rf相比,不仅所产生的分类器具有泛化能力更强、模型表达更稀疏的优点,而且在处理含噪声的数据时具有更好的鲁棒性,在处理大规模数据时具有快得多的训练速度。

关键词: 非凸在线支持向量机, 支持向量机, 特征项, 词频, 相关频率, 文本分类

Abstract: Non-convex online support vector machine(LASVM-NC) has the advantages of strong anti-noise ability and fast training speed, while term frequency relevance frequency?product (tf.rf) is a very good text feature for adaptive classification performance. LASVM-NC+tf.rf is proposed as a new text classification method by combining non-convex support vector machines with term frequency relevance frequency product. It has been shown that the method can perform better than LASVM-NC plus many other features. Moreover, the method can produce faster trained and more robust classifiers with greater generalization and sparser expression than SVM+tf.rf in processing noisy and large-scale datasets.

Key words: non-convex online support vector machine, support vector machines, term weighting, term frequency, relevance frequency, text classification