Evaluation and selection algorithm based on text features multiple indicator fusion

Abstract

Abstract: In text classification, there are many indexes to evaluate whether feature is good or bad. It mainly concludes relevance of features and classes, features’ redundancy and the sparse degree of features in corpus. It’s necessary to consider full range of features’ various factors because of text features directly affecting the classification result. Feature selection always measures features in three steps respectively: relevance, redundancy and sparse, and every step costs lots of time. The method of step by step is very difficult to use when facing the situation of high real-time and accuracy. To solve these problems, firstly, this paper builds coordinate model and puts relevance, redundancy and sparse into the coordinate. Then, it weights and selects text features according to the angle between vector formed by the point and origin and coordinate plane or axis, thereby, it can integrate many indexes into one index to save time and improve the efficiency of feature selection. The experimental results in Fudan University Chinese corpus and Wangyi text corpus show that compared with the method of step by step, evaluation and selection algorithm based on text features multiple indicator fusion can select word features and n-grams features more quickly and more accurately, and the validity has been validated in SVM.

Key words: relevance, redundancy, sparse, coordinate

摘要： 在文本分类问题中，有多种评价特征优劣的指标，其中主要有特征与类别的相关性、特征自身的冗余度和特征在语料中的稀疏程度。由于文本特征的优劣直接影响分类效果，全方位考虑特征的各个因素很有必要。特征选择常分为三步骤分别对相关性、冗余度和稀疏程度进行衡量，而在每一步的加权和筛选过程中都要耗费大量时间，在面对实时性和准确性要求较高的情况时，这种分步评价特征的方法很难适用。针对上述问题，首先建立坐标模型，将相关性、冗余度和稀疏程度映射到坐标系中，根据空间内的点和原点构成的向量与坐标面或坐标轴的夹角对文本特征进行加权和筛选，从而将多个评价指标整合为一个评价指标，大幅节省了多次加权和筛选所耗费的时间，提高了特征选择效率。在复旦大学中文文本语料库和网易文本语料库中的实验结果表明，相比于分步法，基于多指标融合的文本特征评价及选择算法能够更快、更准地筛选词汇和n-grams特征，并在支持向量机（Support Vector Machine，SVM）中验证了特征在分类时的有效性。

关键词: 相关性, 冗余度, 稀疏程度, 坐标系

QIU Yunfei, LIU Shixing, WANG Lu. Evaluation and selection algorithm based on text features multiple indicator fusion[J]. Computer Engineering and Applications, 2016, 52(24): 95-101.

邱云飞，刘世兴，王璐. 基于多指标融合的文本特征评价及选择算法[J]. 计算机工程与应用, 2016, 52(24): 95-101.

[1]	ZHANG Xiaowen, REN Yongfeng. Image Matching Algorithm Combining Sparse Representation and Topological Similarity [J]. Computer Engineering and Applications, 2021, 57(8): 198-203.
[2]	ZOU Jie, LI Jun. Multi-strategy Covariance Matrix Learning Differential Evolution Algorithm [J]. Computer Engineering and Applications, 2021, 57(7): 78-87.
[3]	SHEN Yu, LIU Cheng, YANG Qian. Super-Resolution Image Reconstruction Algorithm Using Sparse Features in Subspace [J]. Computer Engineering and Applications, 2021, 57(5): 173-182.
[4]	LIU Teng, CHEN Heng, LI Guanyu. Knowledge Graph Representation Learning Method Jointing FOL Rules [J]. Computer Engineering and Applications, 2021, 57(4): 100-107.
[5]	TAO Tiwei, LIU Mingxia, WANG Mingliang, WANG Linlin, YANG Deyun, ZHANG Qiang. Effective Distance Based Low-Rank Representation [J]. Computer Engineering and Applications, 2021, 57(4): 141-147.
[6]	Hasan Wumaier, Sirajahmat Ruzmamat, Xireaili Hairela, LIU Wenqi, Tuergen Yibulayin, WANG Liejun, Wayit Abulizi. Bi-directional Uyghur-Chinese Neural Machine Translation with Marked Syllables [J]. Computer Engineering and Applications, 2021, 57(4): 161-168.
[7]	DING Yuxiang, BIAN Weixin, JIE Biao, ZHAO Jun. Super-Resolution Image Reconstruction Based on Neighborhood Regression and Sparse Representation [J]. Computer Engineering and Applications, 2021, 57(2): 230-236.
[8]	DAI Jiangtao, HAN Xiaolong. Coordinated Scheduling of Equipment in Container Terminals Considering Energy Consumption Under Different Job Status [J]. Computer Engineering and Applications, 2021, 57(19): 290-298.
[9]	MA Yang, ZHAO Xujun. Multi-source Outlier Detection Algorithm Based on Relevant Subspace [J]. Computer Engineering and Applications, 2021, 57(17): 88-95.
[10]	TONG Wenlin, CHEN Dewang, HUANG Yunhu, LYU Yisheng. Fuzzy System Optimization Method Based on Simulated Annealing and Rule Reduction [J]. Computer Engineering and Applications, 2021, 57(16): 142-150.
[11]	CHEN Heng, QI Ruihua, ZHU Yi, YANG Chen, GUO Xu, WANG Weimei. Knowledge Graph Completion Method for Semantic Hierarchies of Spherical Coordinate Modeling [J]. Computer Engineering and Applications, 2021, 57(15): 101-108.
[12]	XIA Mengqi, HAO Kun, ZHAO Lu. Monocular Image Depth Estimation Based on Fully Convolutional Encoder-Decoder Network [J]. Computer Engineering and Applications, 2021, 57(14): 231-236.
[13]	YUE Qi, XU Zhongliang, GUO Jifeng. Sparse Feature Extraction Method for Mixed Instruments Music Analysis [J]. Computer Engineering and Applications, 2021, 57(14): 181-186.
[14]	XU Ranran, WU Xiaojun, YIN Hefeng. Face Recognition via Discriminative Non-negative Representation Based Classification [J]. Computer Engineering and Applications, 2021, 57(13): 147-153.
[15]	ZHENG Linwen, ZHOU Jinzhi, HUANG Jing. Application of Deep Sparse Auto-Encoders in ECG Feature Extraction [J]. Computer Engineering and Applications, 2021, 57(11): 156-161.

Evaluation and selection algorithm based on text features multiple indicator fusion

基于多指标融合的文本特征评价及选择算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics