计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (24): 95-101.

• 大数据与云计算 • 上一篇    下一篇

基于多指标融合的文本特征评价及选择算法

邱云飞,刘世兴,王  璐   

  1. 辽宁工程技术大学 软件学院,辽宁 葫芦岛 125105
  • 出版日期:2016-12-15 发布日期:2016-12-20

Evaluation and selection algorithm based on text features multiple indicator fusion

QIU Yunfei, LIU Shixing, WANG Lu   

  1. School of Software, Liaoning Technical University, Huludao, Liaoning 125105, China
  • Online:2016-12-15 Published:2016-12-20

摘要: 在文本分类问题中,有多种评价特征优劣的指标,其中主要有特征与类别的相关性、特征自身的冗余度和特征在语料中的稀疏程度。由于文本特征的优劣直接影响分类效果,全方位考虑特征的各个因素很有必要。特征选择常分为三步骤分别对相关性、冗余度和稀疏程度进行衡量,而在每一步的加权和筛选过程中都要耗费大量时间,在面对实时性和准确性要求较高的情况时,这种分步评价特征的方法很难适用。针对上述问题,首先建立坐标模型,将相关性、冗余度和稀疏程度映射到坐标系中,根据空间内的点和原点构成的向量与坐标面或坐标轴的夹角对文本特征进行加权和筛选,从而将多个评价指标整合为一个评价指标,大幅节省了多次加权和筛选所耗费的时间,提高了特征选择效率。在复旦大学中文文本语料库和网易文本语料库中的实验结果表明,相比于分步法,基于多指标融合的文本特征评价及选择算法能够更快、更准地筛选词汇和n-grams特征,并在支持向量机(Support Vector Machine,SVM)中验证了特征在分类时的有效性。

关键词: 相关性, 冗余度, 稀疏程度, 坐标系

Abstract: In text classification, there are many indexes to evaluate whether feature is good or bad. It mainly concludes relevance of features and classes, features’ redundancy and the sparse degree of features in corpus. It’s necessary to consider full range of features’ various factors because of text features directly affecting the classification result. Feature selection always measures features in three steps respectively: relevance, redundancy and sparse, and every step costs lots of time. The method of step by step is very difficult to use when facing the situation of high real-time and accuracy. To solve these problems, firstly, this paper builds coordinate model and puts relevance, redundancy and sparse into the coordinate. Then, it weights and selects text features according to the angle between vector formed by the point and origin and coordinate plane or axis, thereby, it can integrate many indexes into one index to save time and improve the efficiency of feature selection. The experimental results in Fudan University Chinese corpus and Wangyi text corpus show that compared with the method of step by step, evaluation and selection algorithm based on text features multiple indicator fusion can select word features and n-grams features more quickly and more accurately, and the validity has been validated in SVM.

Key words: relevance, redundancy, sparse, coordinate