基于多指标融合的文本特征评价及选择算法

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (24): 95-101.

基于多指标融合的文本特征评价及选择算法

邱云飞，刘世兴，王璐

辽宁工程技术大学软件学院，辽宁葫芦岛 125105

出版日期:2016-12-15 发布日期:2016-12-20

Evaluation and selection algorithm based on text features multiple indicator fusion

QIU Yunfei, LIU Shixing, WANG Lu

School of Software, Liaoning Technical University, Huludao, Liaoning 125105, China

Online:2016-12-15 Published:2016-12-20

摘要/Abstract

摘要： 在文本分类问题中，有多种评价特征优劣的指标，其中主要有特征与类别的相关性、特征自身的冗余度和特征在语料中的稀疏程度。由于文本特征的优劣直接影响分类效果，全方位考虑特征的各个因素很有必要。特征选择常分为三步骤分别对相关性、冗余度和稀疏程度进行衡量，而在每一步的加权和筛选过程中都要耗费大量时间，在面对实时性和准确性要求较高的情况时，这种分步评价特征的方法很难适用。针对上述问题，首先建立坐标模型，将相关性、冗余度和稀疏程度映射到坐标系中，根据空间内的点和原点构成的向量与坐标面或坐标轴的夹角对文本特征进行加权和筛选，从而将多个评价指标整合为一个评价指标，大幅节省了多次加权和筛选所耗费的时间，提高了特征选择效率。在复旦大学中文文本语料库和网易文本语料库中的实验结果表明，相比于分步法，基于多指标融合的文本特征评价及选择算法能够更快、更准地筛选词汇和n-grams特征，并在支持向量机（Support Vector Machine，SVM）中验证了特征在分类时的有效性。

关键词: 相关性, 冗余度, 稀疏程度, 坐标系

Abstract: In text classification, there are many indexes to evaluate whether feature is good or bad. It mainly concludes relevance of features and classes, features’ redundancy and the sparse degree of features in corpus. It’s necessary to consider full range of features’ various factors because of text features directly affecting the classification result. Feature selection always measures features in three steps respectively: relevance, redundancy and sparse, and every step costs lots of time. The method of step by step is very difficult to use when facing the situation of high real-time and accuracy. To solve these problems, firstly, this paper builds coordinate model and puts relevance, redundancy and sparse into the coordinate. Then, it weights and selects text features according to the angle between vector formed by the point and origin and coordinate plane or axis, thereby, it can integrate many indexes into one index to save time and improve the efficiency of feature selection. The experimental results in Fudan University Chinese corpus and Wangyi text corpus show that compared with the method of step by step, evaluation and selection algorithm based on text features multiple indicator fusion can select word features and n-grams features more quickly and more accurately, and the validity has been validated in SVM.

Key words: relevance, redundancy, sparse, coordinate

邱云飞，刘世兴，王璐. 基于多指标融合的文本特征评价及选择算法[J]. 计算机工程与应用, 2016, 52(24): 95-101.

QIU Yunfei, LIU Shixing, WANG Lu. Evaluation and selection algorithm based on text features multiple indicator fusion[J]. Computer Engineering and Applications, 2016, 52(24): 95-101.

[1]	邹杰，李俊. 多策略协方差矩阵学习差分进化算法[J]. 计算机工程与应用, 2021, 57(7): 78-87.
[2]	杨力，吴义，魏德宾，潘成胜. 基于时空相关性的卫星网络流量预测[J]. 计算机工程与应用, 2021, 57(7): 101-106.
[3]	张杰，常天庆，郭理彬，张雷，马金盾. 基于跟踪异常与相关性检验的目标丢失判断[J]. 计算机工程与应用, 2021, 57(18): 204-212.
[4]	陈恒，祁瑞华，朱毅，杨晨，郭旭，王维美. 球坐标建模语义分层的知识图谱补全方法[J]. 计算机工程与应用, 2021, 57(15): 101-108.
[5]	刘虹，王烈. 结合余弦相关性的卷积网络识别汉字的方法[J]. 计算机工程与应用, 2020, 56(8): 130-135.
[6]	王晶晶，杨有龙. 针对弱标记数据的多标签分类算法[J]. 计算机工程与应用, 2020, 56(5): 65-73.
[7]	郭莎莎，李爽，阎红灿. 已知时间的空间文本skyline查询[J]. 计算机工程与应用, 2020, 56(24): 59-65.
[8]	陈朝辉，杨湘，李鹏，何亨. 流内编码中考虑链路相关性的机会路由机制[J]. 计算机工程与应用, 2020, 56(24): 85-94.
[9]	向敏，戴柯宇，周恩，刘榆，雷儒杰. 面向物联网终端的任务相关性调度策略[J]. 计算机工程与应用, 2020, 56(23): 95-102.
[10]	高琦，李红娇. 面向用电数据的周期敏感度差分隐私保护方法[J]. 计算机工程与应用, 2020, 56(20): 73-81.
[11]	杨飞跃，陶洋. 时空协作的WSNs节点异常检测算法[J]. 计算机工程与应用, 2019, 55(7): 127-131.
[12]	孙登第1，孟欠欠1，2，马云鹏1，2. 图正则化迁移稀疏概念编码的跨域图像分类[J]. 计算机工程与应用, 2019, 55(6): 197-203.
[13]	李锋，杨有龙. 基于标签特征和相关性的多标签分类算法[J]. 计算机工程与应用, 2019, 55(4): 48-55.
[14]	安纪存，吕鑫，季琳雅. 不完全数据下基于时空相关性拥堵预测方法[J]. 计算机工程与应用, 2019, 55(4): 96-100.
[15]	刘亮，张霖，杨柳，庞瑞琴，汪涛. 数据中心网络中相关感知流量整合算法[J]. 计算机工程与应用, 2019, 55(24): 62-67.