文本分类中词语权重计算方法的改进与应用

计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (5): 187-189.

文本分类中词语权重计算方法的改进与应用

熊忠阳,黎刚,陈小莉,陈伟

重庆大学计算机学院，重庆 400030

收稿日期:2007-05-28 修回日期:2007-07-25 出版日期:2008-02-11 发布日期:2008-02-11
通讯作者: 熊忠阳

Improvement and application to weighting terms based on text classification

XIONG Zhong-yang,LI Gang,CHEN Xiao-li,CHEN Wei

College of Computer，Chongqing University，Chongqing 400030，China

Received:2007-05-28 Revised:2007-07-25 Online:2008-02-11 Published:2008-02-11
Contact: XIONG Zhong-yang

摘要/Abstract

摘要： 文本的形式化表示一直是信息检索领域关注的基础性问题。向量空间模型（Vector Space Model）中的tf.idf文本表示是该领域里得到广泛应用，并且取得较好效果的一种文本表示方法。词语在文本集合中的分布比例量上的差异是决定词语表达文本内容的重要因素之一。但是其IDF的计算，并没有考虑到特征项在类间的分布情况，也没有考虑到在类内分布相对均匀的特征项的权重应该比分布不均匀的要高，应该赋予其较高的权重。用改进的TFIDF选择特征词条、用KNN分类算法和遗传算法训练分类器来验证其有效性，实验表明改进的策略是可行的。

关键词: 文本表示, 向量空间模型, 特征选择, TFIDF

Abstract: Text representation has been the fundamental problem in Information Retrieval.tf.idf（term frequency，inverse document frequency） as one of term weighting schemes in Vector Space Model is a good text representation，Which is popular and make good results in the field of Information Retrieval.The difference of the proportion of distribution of terms in text collection is one of the most important factors of expressing the content of text.But the calculation of IDF，don’t consider the information of distribution about terms among classes，and don’t consider the more term weighting for the terms of the relative distributed balance inner classes.The improved TFIDF are used to select feature，KNN algorithm and genetic algorithm are used to train the classifier.and proves that the improved TFIDF method is feasible.

Key words: text representation, Vector Space Model, feature selection, TFIDF

熊忠阳,黎刚,陈小莉,陈伟. 文本分类中词语权重计算方法的改进与应用[J]. 计算机工程与应用, 2008, 44(5): 187-189.

XIONG Zhong-yang,LI Gang,CHEN Xiao-li,CHEN Wei. Improvement and application to weighting terms based on text classification[J]. Computer Engineering and Applications, 2008, 44(5): 187-189.

[1]	李莉，纪欣沅，宋嵩. 回环软件缺陷数量预测模型[J]. 计算机工程与应用, 2021, 57(7): 158-163.
[2]	李静星，杨有龙. 针对高维数据的马尔科夫毯特征选择[J]. 计算机工程与应用, 2021, 57(6): 58-66.
[3]	林炜星，王宇嘉，陈万芬，梁海娜. 基于多因子粒子群的高维数据特征选择算法[J]. 计算机工程与应用, 2021, 57(22): 199-207.
[4]	李珑珠，林耀进，吕彦，卢舜，王晨曦. 利用邻域信息交互的在线流特征选择算法[J]. 计算机工程与应用, 2021, 57(21): 102-108.
[5]	陈倩茹，李雅丽，许科全，刘铱龙，王淑琴. 自调优自适应遗传算法的WKNN特征选择方法[J]. 计算机工程与应用, 2021, 57(20): 164-171.
[6]	武炜杰，张景祥. 融合分类信息的随机森林特征选择算法及应用[J]. 计算机工程与应用, 2021, 57(17): 147-156.
[7]	邱云飞，高华聪. 混合Filter与改进自适应GA的特征选择方法[J]. 计算机工程与应用, 2021, 57(11): 95-102.
[8]	霍林，陆寅丽. 改进粒子群算法应用于Android恶意应用检测[J]. 计算机工程与应用, 2020, 56(7): 96-101.
[9]	廖文雄，曾碧，梁天恺，徐雅芸，赵俊峰. 面向高维数据的个人信贷风险评估方法[J]. 计算机工程与应用, 2020, 56(4): 219-224.
[10]	彭明，张海澎. 基于Schatten-p范数和特征自表示的无监督特征选择[J]. 计算机工程与应用, 2020, 56(23): 45-52.
[11]	韩邦，李子臣，汤永利. 基于同态加密的全文检索方案设计与实现[J]. 计算机工程与应用, 2020, 56(21): 103-107.
[12]	刘峰，Godfred Kim Mensah，李欣芸，刘鸿丽，李瑶，郭浩. 不确定脑网络的异常拓扑分析及分类研究[J]. 计算机工程与应用, 2020, 56(2): 127-132.
[13]	岳鹏，侯凌燕，杨大利，佟强. 基于XGBoost特征选择的疾病诊断XLC-Stacking方法[J]. 计算机工程与应用, 2020, 56(17): 136-141.
[14]	黄欣，莫海淼，赵志刚，曾敏. 离散型增强烟花算法和[kNN]在特征选择中的研究[J]. 计算机工程与应用, 2020, 56(16): 112-117.
[15]	周婉莹，马盈仓，续秋霞，郑毅. 最大熵和[l2,0]范数约束的无监督特征选择算法[J]. 计算机工程与应用, 2020, 56(11): 51-59.