Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (9): 104-111.DOI: 10.3778/j.issn.1002-8331.2207-0440

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Research on Text Classification by Fusing Multi-Granularity Information

XIN Miaomiao, MA Li, HU Bofa   

  1. 1.School of Information Engineering, Hebei GEO University, Shijiazhuang 050031, China
    2.Laboratory of Artificial Intelligence and Machine Learning, Hebei GEO University, Shijiazhuang 050031, China
  • Online:2023-05-01 Published:2023-05-01

融合多粒度信息的文本分类研究

辛苗苗,马丽,胡博发   

  1. 1.河北地质大学 信息工程学院,石家庄 050031
    2.河北地质大学 人工智能与机器学习研究室,石家庄 050031

Abstract: Current research on Chinese text classification focuses on a single pattern of classifying data information at character granularity, word granularity, sentence granularity and chapter granularity, which often lacks the information features contained in the semantics at different granularities. In order to extract the core content of the text more effectively, a text classification model based on attention mechanism fusing multi-granularity information is proposed. The model constructs embedding vectors for character, word and sentence granularity, where the Word2Vec training model is used for character and word granularity to convert the data into character and word vectors, and the contextual semantic features of the character and word granularity vectors are obtained through a bidirectional long and short-term memory network, and the features contained in the sentence vectors are extracted using the FastText model, and the different feature vectors are fed into the attention mechanism layer to obtain further important semantic information about the text. The experimental results show that the classification accuracy of the model on the three publicly available Chinese datasets is improved over both single granularity and a combination of two or two granularities.

Key words: multi-granularity, information fusion, text classification, attention mechanism

摘要: 目前对中文文本分类的研究主要集中于对字符粒度、词语粒度、句子粒度、篇章粒度等数据信息的单一模式划分,这往往缺少不同粒度下语义所包含的信息特征。为了更加有效提取文本所要表达的核心内容,提出一种基于注意力机制融合多粒度信息的文本分类模型。该模型对字、词和句子粒度方面构造嵌入向量,其中对字和词粒度采用Word2Vec训练模型将数据转换为字向量和词向量,通过双向长短期记忆网络(bi-directional long short-term memory,BiLSTM)获取字和词粒度向量的上下文语义特征,利用FastText模型提取句子向量中包含的特征,将不同种特征向量分别送入到注意力机制层进一步获取文本重要的语义信息。实验结果表明,该模型在三种公开的中文数据集上的分类准确率比单一粒度和两两粒度结合的分类准确率都有所提高。

关键词: 多粒度, 信息融合, 文本分类, 注意力机制