计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (13): 95-100.

• 大数据与云计算 • 上一篇    下一篇

融合BTM主题特征的短文本分类方法

郑  诚1,2,吴文岫1,2,代  宁1,2   

  1. 1.安徽大学 计算机科学与技术学院,合肥 230601
    2.计算智能和信号处理教育部重点实验室,合肥 230601
  • 出版日期:2016-07-01 发布日期:2016-07-15

Improved short text classification method based on BTM topic features

ZHENG Cheng1,2, WU Wenxiu1,2, DAI Ning1,2   

  1. 1.School of Computer Science and Technology, Anhui University, Hefei 230601, China
    2.Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Hefei 230601, China
  • Online:2016-07-01 Published:2016-07-15

摘要: 针对短文本特征较少而导致使用传统文本分类算法进行分类效果并不理想的问题,提出了一种融合BTM主题特征和改进了特征权重计算的综合特征提取方法来进行短文本分类。方法中,在TF-IWF的基础上降低词频权重并引入词分布熵,衍生出新的算法计算权重。结合BTM主题模型中各主题下的主题词对词数较少的文档进行补充,并选择每篇文档在各个主题下的概率分布作为另一部分文档特征。通过KNN算法进行多组分类实验,结果证明该方法与传统的TF-IWF等方法计算特征进行比较,F1的结果提高了10%左右,验证了方法的有效性。

关键词: 短文本, 权重计算, TF-IWF方法, 主题模型

Abstract: Short texts are normally featured with less content, looser text format, varied sentence length and relatively complex sentence structure. Consequently, the effects of traditional classification algorithms are quite unsatisfactory. This paper presents an authentic comprehensive method by the fusion of BTM theme features and well-improved weight calculation method for short text classification. In order to achieve this, two steps are in necessity. Firstly, the paper reduces the term frequency weight according to the TF-IWF. In the meantime, it introduces the word distribution probability value so that a new algorithm for computing weights will derive. Secondly, it uses the topic words of BTM topic model to complement empty documents. Meanwhile, the probability distribution of each document in each topic will be carefully selected as the document’s other features. Experimental results indicate that with the help of this newly created method, the results of F1 has been improved by around 10% compared to the original TF-IWF method.

Key words: short text, weight calculation, Inverse Word Frequency(TF-IWF), topic model