融合多粒度信息的文本分类研究

doi:10.3778/j.issn.1002-8331.2207-0440

摘要/Abstract

摘要： 目前对中文文本分类的研究主要集中于对字符粒度、词语粒度、句子粒度、篇章粒度等数据信息的单一模式划分，这往往缺少不同粒度下语义所包含的信息特征。为了更加有效提取文本所要表达的核心内容，提出一种基于注意力机制融合多粒度信息的文本分类模型。该模型对字、词和句子粒度方面构造嵌入向量，其中对字和词粒度采用Word2Vec训练模型将数据转换为字向量和词向量，通过双向长短期记忆网络（bi-directional long short-term memory，BiLSTM）获取字和词粒度向量的上下文语义特征，利用FastText模型提取句子向量中包含的特征，将不同种特征向量分别送入到注意力机制层进一步获取文本重要的语义信息。实验结果表明，该模型在三种公开的中文数据集上的分类准确率比单一粒度和两两粒度结合的分类准确率都有所提高。

关键词: 多粒度, 信息融合, 文本分类, 注意力机制

Abstract: Current research on Chinese text classification focuses on a single pattern of classifying data information at character granularity, word granularity, sentence granularity and chapter granularity, which often lacks the information features contained in the semantics at different granularities. In order to extract the core content of the text more effectively, a text classification model based on attention mechanism fusing multi-granularity information is proposed. The model constructs embedding vectors for character, word and sentence granularity, where the Word2Vec training model is used for character and word granularity to convert the data into character and word vectors, and the contextual semantic features of the character and word granularity vectors are obtained through a bidirectional long and short-term memory network, and the features contained in the sentence vectors are extracted using the FastText model, and the different feature vectors are fed into the attention mechanism layer to obtain further important semantic information about the text. The experimental results show that the classification accuracy of the model on the three publicly available Chinese datasets is improved over both single granularity and a combination of two or two granularities.

Key words: multi-granularity, information fusion, text classification, attention mechanism

辛苗苗, 马丽, 胡博发. 融合多粒度信息的文本分类研究[J]. 计算机工程与应用, 2023, 59(9): 104-111.

XIN Miaomiao, MA Li, HU Bofa. Research on Text Classification by Fusing Multi-Granularity Information[J]. Computer Engineering and Applications, 2023, 59(9): 104-111.

参考文献

[1] 何铠.基于自然语言处理的文本分类研究与应用[D].南京：南京邮电大学，2020.
HE K.Research and application of text classification based on natural language processing[D].Nanjing：Nanjing University of Posts and Telecommunications，2020.
[2] 杨春霞，李锐，秦家鹏.一种粒度融合的新闻文本主题分类模型[J].小型微型计算机系统，2020，41（11）：2256-2259.
YANG C X，LI R，QIN J P.Granular fusion news text topic classification model[J].Journal of Chinese Computer Systems，2020，41（11）：2256-2259.
[3] QIAN C，WEN L，KUMAR A，et al.An approach for process model extraction by multi-grained text classification[C]//International Conference on Advanced Information Systems Engineering.Cham：Springer，2020：268-282.
[4] COLAS F，BRAZDIL P.Comparison of SVM and some older classification algorithms in text classification tasks[C]//IFIP International Conference on Artificial Intelligence in Theory and Practice.Boston：Springer，2006：169-178.
[5] ZHOU Y，LI Y，XIA S.An improved KNN text classification algorithm based on clustering[J].Journal of Computers，2009，4（3）：230-237.
[6] 邸鹏，段利国.一种新型朴素贝叶斯文本分类算法[J].数据采集与处理，2014，29（1）：71-75.
DI P，DUAN L G.New Na?Ve Bayes text classification algorithm[J].Journal of Data Acquisition and Processing，2014，29（1）：71-75.
[7] CHEN T，GUESTRIN C.Xgboost：a scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining，2016：785-794.
[8] GU J，WANG Z，KUEN J，et al.Recent advances in convolutional neural networks[J].Pattern Recognition，2018，77：354-377.
[9] MEDSKER L R，JAIN L C.Recurrent neural networks[J].Design and Applications，2001，5：64-67.
[10] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural Computation，1997，9（8）：1735-1780.
[11] GRAVES A，SCHMIDHUBER J.Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J].Neural Networks，2005（18）：602-610.
[12] JOULIN A，GRAVE E，BOJANOWSKI P，et al.Bag of tricks for efficient text classification[J].arXiv：1607.01759，2016.
[13] KIM Y.Convolutional neural networks for sentence classification[J].arXiv：1408.5882，2014.
[14] LIU W，PANG J，LI N，et al.Research on multi-label text classification method based on tALBERT-CNN[J].International Journal of Computational Intelligence Systems，2021，14（1）：1-12.
[15] 余本功，许庆堂，张培行.基于MAC-LSTM的问题分类研究[J].计算机应用研究，2020，37（1）：40-43.
YU B G，XU Q T，ZHANG P X.Question classification based on MAC-LSTM[J].Application Research of Computers，2020，37（1）：40-43.
[16] 曾谁飞，张笑燕，杜晓峰，等.基于神经网络的文本表示模型新方法[J].通信学报，2017，38（4）：86-98.
ZENG S F，ZHANG X Y，DU X F，et al.New method of text representation model based on neural network[J].Journal of Communications，2017，38（4）：86-89.
[17] SACHAN D S，ZAHEER M，SALAKHUTDINOV R.Revisiting LSTM networks for semi-supervised text classification via mixed objective function[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：6940-6948.
[18] 王婉，张向先，卢恒，等.融合FastText模型和注意力机制的网络新闻文本分类模型[J].现代情报，2022，42（3）：40-47.
WANG W，ZHANG X X，LU H，et al.Research on network news text classification model based on fasttext and attention mechanism[J].Journal of Modern Information，2022，42（3）：40-47.
[19] MNIH V，HEESS N，GRAVES A.Recurrent models of visual attention[J].Advances in Neural Information Processing Systems，2014，27.
[20] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[J].arXiv：1706.03762v2，2017.
[21] 杨兴锐，赵寿为，张如学，等.结合自注意力和残差的BiLSTM_CNN文本分类模型[J].计算机工程与应用，2022，58（3）：172-180.
YANG X R，ZHAO S W，ZHANG R X，et al.BiLSTM_ CNN classification model based on self-attention and residual network[J].Computer Engineering and Applications，2022，58（3）：172-180.
[22] LI X，CUI M，LI J，et al.A hybrid medical text classification framework：integrating attentive rule construction and neural network[J].Neurocomputing，2021，443：345-355.
[23] GUO M H，LIU Z N，MU T J，et al.Beyond self-attention：external attention using two linear layers for visual tasks[J].arXiv：2105.02358，2021.