计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (23): 18-23.DOI: 10.3778/j.issn.1002-8331.1709-0162

• 热点与综述 • 上一篇    下一篇

基于HDP的监督多标签文本分类研究

谢晨阳1,卢焱鑫2   

  1. 1.武汉大学 计算机学院,武汉 430000
    2.武汉大学 软件工程国家重点实验室,武汉 430000
  • 出版日期:2017-12-01 发布日期:2017-12-14

Supervise multi-label text classification based on hierarchical dirichlet process

XIE Chenyang1,LU Yanxin2   

  1. 1. Computer School, Wuhan University, Wuhan 430000, China
    2. State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430000, China
  • Online:2017-12-01 Published:2017-12-14

摘要: 随着互联网和信息技术的发展,大量的多标签文本数据快速产生。在文本分类中如何确定合适的分类数目以及如何更加准确地辨别文档的标签是亟待解决的问题。提出的HL_LDA模型通过层次狄利克雷过程自动确定分类的数目,通过发掘多标签文档的标签之间的层次信息提高分类的质量。实验结果表明在不同类型的数据集中,和经典的LDA,SVM等方法相比,HL_LDA在精度,F1-score等评估指标上明显优于现有的方法。

关键词: 多标签, 文本分类, 标签依赖, 层次狄利克雷过程

Abstract: With the development of Internet and information technology, a large number of multi-label texts data quickly generated. In the text classification, how to determine the appropriate number of categories and how to identify the label of the textmore accurately is an urgent problem to be solved. The HL_LDA model proposed in this paper automatically determines the number of categories through the hierarchical Dirichlet process, and improves the quality of the classification by discovering the hierarchical information between labels of multi-label documents. The experimental results show that the evaluation of HL_LDA is superior to the existing method in precision and F1-score compared with the LDA- based and SVM-based methods on different types of data sets.

Key words: multi-label, text clustering, tag dependence, hierarchical Dirichlet process