计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (10): 164-166.

• 数据库、信号与信息处理 • 上一篇    下一篇

一种基于最少出现文档频的文本特征提取方法

苏  丹,周明全,王学松,任玉芝   

  1. 北京师范大学 信息科学与技术学院,北京 100875
  • 出版日期:2012-04-01 发布日期:2012-04-11

Method based on least document frequency for text feature extraction

SU Dan, ZHOU Mingquan, WANG Xuesong, REN Yuzhi   

  1. College of Information Science and Technology, Beijing Normal University, Beijing 100875, China
  • Online:2012-04-01 Published:2012-04-11

摘要: 传统特征提取改进方法在特征分布信息的量化方面存在不足,很大程度上影响了其分类效能。针对这一问题,提出一种基于最少出现文档频的特征提取改进方法,即TF-LDF算法。该算法用最少出现文档频来量化特征类间集中度与类内离散度,能够更加准确地反映特征分布情况。通过实验结果比较,可以证明TF-LDF算法分类效果更佳。

关键词: 特征提取, 特征分布, 类间集中度, 类内离散度, 文档-最少出现文档频率(TF-LDF)

Abstract: Conventional methods of text feature extraction are inadequate at distribution quantification, which to a large extent affects the efficiency of classification. Aiming at this problem, a scheme of Least Document Frequency(LDF) is proposed, which can quantify the concentration and dispersion among feature classes through LDF, thus can reflect the characteristics of the distribution more accurately. Through experiments, TF-LDF algorithm can acquire a better result.

Key words: feature extraction, feature distribution, concentration among classes, dispersion within class, Term Frequency-Least Document Frequency(TF-LDF)