Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (9): 97-102.DOI: 10.3778/j.issn.1002-8331.1601-0405

Previous Articles     Next Articles

Two-stage hierarchical text classification model based on neighbor-assistant strategy

GU Ping, WANG Chunyuan   

  1. College of Computer Science, Chongqing University, Chongqing 400044, China
  • Online:2017-05-01 Published:2017-05-15


古  平,王春元   

  1. 重庆大学 计算机学院,重庆 400044

Abstract: The traditional Two-stage Hierarchical Text Classification model(THTC model)is an effective method to solve the problem of large-scale hierarchical text classification, but it still suffers from low classification accuracy. To alleviate this problem, a new Two-stage Hierarchical Text Classification model based on Neighbor-Assistant strategy(THTC-NA model)is proposed. THTC-NA model consists of two stages: search and classification. In the search stage, the flat strategy is used to select the related categories for a given document from all leaf categories. The categories are ranked and the most related categories are taken as category candidates. Thus, a large-scale hierarchy is pruned into a much smaller but focused one. In the classification stage, the classification results of each candidate are computed by combining the results of ancestor categories and sibling categories of the candidate. Finally, the results of the search stage and the classification stage are fused together todetermine the target category for a given document. The experiments on the data set Newsgroups-18828 show that, compared with the THTC model, the THTC-NA model has a great help to improve the classification accuracy.

Key words: two-stage, hierarchical text classification, neighbor-assistant strategy, class hierarchy

摘要: 传统两阶段层次文本分类模型(THTC模型)是一种解决大规模层次文本分类问题的有效方法,但该模型的分类准确率仍然不是很高。为了缓解这个问题,提出了结合邻居辅助策略的两阶段层次文本分类模型(THTC-NA模型)。THTC-NA模型由搜索阶段和分类阶段组成。搜索阶段采用扁平策略从所有的叶子类别中选择与待分类文档最相关的[k]个类别作为候选类别集,这样可以大大减小分类阶段的搜索空间。分类阶段通过结合候选类别的祖先类别和兄弟类别的分类结果来帮助计算候选类别在分类阶段的结果。最后将搜索阶段的结果和分类阶段的结果融合起来共同决定待分类文档的目标类别。在数据集Newsgroups-18828上的实验表明,相对于THTC模型,THTC-NA模型对提高层次文本分类准确率有很大的帮助。

关键词: 两阶段, 层次文本分类, 邻居辅助策略, 类别层次