计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (15): 193-199.DOI: 10.3778/j.issn.1002-8331.2004-0352

• 模式识别与人工智能 • 上一篇    下一篇

采用多级特征的多标签长文本分类算法

王浩镔,胡平   

  1. 1.西安交通大学 软件学院,西安 710000
    2.西安交通大学 管理学院,西安 710000
  • 出版日期:2021-08-01 发布日期:2021-07-26

Multi-label Long Text Classification Algorithm Based on Multi-level Features

WANG Haobin, HU Ping   

  1. 1.School of Software, Xi’an Jiaotong University, Xi’an 710000, China
    2.School of Management, Xi’an Jiaotong University, Xi’an 710000, China
  • Online:2021-08-01 Published:2021-07-26

摘要:

针对现有多标签分类算法忽略了标签之间的内生关系,将多标签分类问题转化为序列生成问题,充分考虑标签之间的共生关系,以Seq2Seq模型为基础,从词语级别和语义级别两个维度提取文本特征,通过对特征提取模块、编码器结构、混合注意力机制、解码器预测部分的改进,提出了基于多级特征和混合注意力机制的多标签分类算法。在Zhihu、RCV1-V2和AAPD三个数据集上进行算法有效性验证并与现有算法对比,提出的算法在F1值、召回率和汉明损失三个指标上均优于其他算法。

关键词: 多标签分类, 多级特征, 混合注意力

Abstract:

For the existing multi-label classification algorithm has ignored the endogenous relationship between the labels, In this paper, the multi-label classification problem is converted into a sequence generation problem, and the symbiotic relationship between the labels is fully considered. Based on the Seq2Seq model, text features are extracted from two dimensions:word level and semantic level. By improving the feature extraction module, encoder structure, mixed attention mechanism, and decoder prediction part, a multi-label classification algorithm based on multi-level features and mixed attention mechanism is proposed. The effectiveness of the algorithm is verified on the three data sets of Zhihu, RCV1-V2 and AAPD and compared with existing algorithms. The proposed algorithm is superior to other algorithms in F1 value, recall rate and Hamming loss.

Key words: multi-label classification, multi-level features, mixed attention