计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (2): 104-109.DOI: 10.3778/j.issn.1002-8331.1805-0071

• 模式识别与人工智能 • 上一篇    下一篇

文本分类TF-IDF算法的改进研究

叶雪梅1,2,毛雪岷1,2,夏锦春1,2,王  波1,2   

  1. 1.合肥工业大学 管理学院,合肥 230009
    2.合肥工业大学 过程优化与智能决策教育部重点实验室,合肥 230009
  • 出版日期:2019-01-15 发布日期:2019-01-15

Improved Approach to TF-IDF Algorithm in Text Classification

YE Xuemei1,2, MAO Xuemin1,2, XIA Jinchun1,2, WANG Bo1,2   

  1. 1.School of Management, Hefei University of Technology, Hefei 230009, China
    2.Key Laboratory of Process Optimization and Intelligent Decision-Making(MoE), Hefei University of Technology, Hefei 230009, China
  • Online:2019-01-15 Published:2019-01-15

摘要: 中国互联网环境的发展,让大量蕴含丰富信息的新词得以普及。而传统的特征词权重TF-IDF(Term Frequency and Inverted Document Frequency)算法主要考虑TF和IDF两个方面的因素,未考虑到新词这一新兴词类的优势。针对特征项中的新词对分类结果的影响,提出基于网络新词改进文本分类TF-IDF算法。在文本预处理中识别新词,并在向量空间模型表示中改变特征权重计算公式。实验结果表明把新词发现加入文本预处理,可以达到特征降维的目的,并且改进后的特征权重算法能优化文本分类的结果。

关键词: 新词, 词频-逆文档频率(TF-IDF), 向量空间模型, 文本分类

Abstract: With the development of Internet environment in China, a lot of new words with rich information have been popularized. The traditional term weight algorithm named TF-IDF(Term Frequency and Inverted Document Frequency) mainly considers two factors named TF and IDF without the advantage of new words. In view of the influence of new words in feature items on classification results, an improved TF-IDF algorithm based on new words of network is proposed in text classification. Research recognizes new words in the text preprocessing, and improves the weight calculation formula of them in the vector space model representation. Experimental results show that adding new word discovery process to text preprocessing can reduce feature dimension, meanwhile, the improved TF-IDF algorithm can optimize the result of text classification.

Key words: new words, Term Frequency and Inverted Document Frequency(TF-IDF), vector space model, text classification