Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (13): 92-98.DOI: 10.3778/j.issn.1002-8331.2203-0286

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Academic Resource Text Hierarchical Multi-Label Classification

WANG Yue, LI Yawen, LI Ang   

  1. 1.Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
    2.School of Economics and Management, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Online:2023-07-01 Published:2023-07-01

科技资源文本层次多标签分类方法

王岳,李雅文,李昂   

  1. 1.北京邮电大学 计算机学院(国家示范性软件学院) 智能通信软件与多媒体北京市重点实验室,北京100876
    2.北京邮电大学 经济管理学院,北京 100876

Abstract: The hierarchical multi-label text classification of scientific resource is used to assign scientific resource texts to a label system with a hierarchical structure. A text-level multi-label classification algorithm for scientific resources based on attention mechanism is proposed. The attention mechanism layer is constructed by integrating features such as text, keywords, and hierarchy message, which can be used to improve HMCN-F network to classify scientific and technological resource documents into the most relevant categories. In detail, word2vec and BiLSTM are mainly used to obtain the embedded vector and latent vector representation of text, keywords and hierarchical structures; the hierarchical attention mechanism is used to capture the correlation between keywords, label hierarchy and text word vectors to strengthen key words. The weight of the vector is used to generate a hierarchy-specific document embedding vector, which replaces the original text embedding in HMCN-F. The experimental results verify the effectiveness of the AHMCA method.

Key words: hierarchical multi-label classification, attention mechanism, BiLSTM, word2vec

摘要: 科技资源文本层次多标签分类(hierarchical multi-label text classification,HMTC)用于将科技资源文本分配到一个具有层级结构的标签体系中。提出基于注意力机制的科技资源文本层次多标签分类算法(academic resource text hierarchical multi-label classification based on attention,AHMCA)。通过整合文本、关键词、层次结构等特征构造注意力机制层,对HMCN-F(hierarchical multi-label classification network-feed-forward)网络进行改进,将科技资源文档逐级分类到最相关的类别中。细节上,主要利用word2vec与BiLSTM来获得文本、关键词、层次结构的嵌入向量和隐向量表示;利用层次注意力机制捕获关键词、标签层次结构与文本词向量之间的关联关系来强化重点词向量的权重,从而生成特定于层级的文档嵌入向量,替代HMCN-F中原始的文本嵌入。实验结果验证了AHMCA方法的有效性。

关键词: 层次多标签分类, 注意力机制, BiLSTM, word2vec