Chinese Long Text Classification Model Based on BERT Fused Chinese Input Methods and BLCG

doi:10.3778/j.issn.1002-8331.2212-0357

Abstract

Abstract: The existing Chinese long text classification models do not take into account the Chinese feature information such as phonetic and morphological, so they cannot fully represent Chinese semantic information. Meanwhile, the occurrence of some sentences containing many information which is either unrelated to the target topic or related to other topics, leads to misjudgment of the classifying model. In order to solve the problem, A Chinese long text classifying model based on CIMBERT (BERT fused Chinese input methods) and BLCG (BiLSTM fused CNN with gate) is proposed. Firstly, the representations of text vector are carried out by using the BERT model with adopting Chinese input methods. As for the input vector representations of BERT, Pinyin and Wubi which are widely used for Chinese character input methods, are applied to enhance the semantic information of Chinese characters. Furthermore, BLCG is constructed to extract the whole features of texts by means of utilizing LSTM (long short-term memory) method to obtain the global features and CNN (convolutional neural network) method to acquire the local features. The gating mechanism of BLCG can dynamically combine both global features and local features to overcome the classifying model faults owing to unable to identify unrelated topics of texts. Finally, proposed method is tested on THUCNews datasets and Sogou datasets. The results of the experiment show that the classification accuracy is 97.63%, 95.43% and F1-score is 97.68%, 95.49% respectively, which can indicate the purposed model is superior to other text classifying models to some extent.

Key words: long text classification, bidirectional encoder representations from Transformers (BERT), convolutional neural network (CNN), long short-term memory (LSTM), gating mechanism

摘要： 现有的中文长文本分类模型中，没有考虑汉字读音、笔画等特征信息，因此不能充分表示中文语义；同时，长文本中常常包含大量与目标主题无关的信息，甚至部分文本与其他主题相关，导致模型误判。为此，提出了一种融合汉字输入法的BERT（BERT fused Chinese input methods，CIMBERT）、带有门控机制的长短期记忆卷积网络（BiLSTM fused CNN with gating mechanism，BLCG）相结合的文本分类方法。该方法使用BERT模型进行文本的向量表示，在BERT模型的输入向量中，采用了拼音和五笔两种常用的汉字输入法，增强了汉字的语义信息。建立了BLCG模型进行文本特征提取，该模型使用双向长短期记忆网络（BiLSTM）进行全局特征提取、卷积神经网络（CNN）进行局部特征提取，并通过门控机制（gating mechanism）动态融合全局特征和局部特征，解决了部分文本与目标主题无关导致模型误判的问题。在THUCNews数据集与Sogou语料库上对该方法进行了验证，其准确率为97.63%、95.43%，F1-score为97.68%、95.49%，优于其他文本分类模型。

关键词: 长文本分类, BERT模型, 卷积神经网络, 长短期记忆网络, 门控机制

YANG Wentao, LEI Yuqi, LI Xingyue, ZHENG Tiancheng. Chinese Long Text Classification Model Based on BERT Fused Chinese Input Methods and BLCG[J]. Computer Engineering and Applications, 2024, 60(9): 196-202.

杨文涛, 雷雨琦, 李星月, 郑天成. 融合汉字输入法的BERT与BLCG的长文本分类研究[J]. 计算机工程与应用, 2024, 60(9): 196-202.

References

[1] 熊回香, 杨梦婷, 李玉媛. 基于深度学习的信息组织与检索研究综述[J]. 情报科学, 2020, 38(3): 3-10.
XIONG H X, YANG M T, LI Y Y. A survey of information organization and retrieval based on deep learning[J]. Information Science, 2020, 38(3): 3-10.
[2] 曾凡锋, 李玉珂, 肖珂. 基于卷积神经网络的语句级新闻分类算法[J]. 计算机工程与设计, 2020, 41(4): 978-982.
ZENG F F, LI Y K, XIAO K. Sentence-level fine-grained news classification based on convolutional neural network[J]. Computer Engineering and Design, 2020, 41(4): 978-982.
[3] LIU Y, LAPATA M. Learning structured text representations[J]. Transactions of the Association for Computational Linguistics, 2018, 6(1): 63-75.
[4] CHEN Z, ZHOU L, DA LI X, et al. The Lao text classification method based on KNN[J]. Procedia Computer Science, 2020, 166: 523-528.
[5] KOWSARI K, MEIMANDI K J, HEIDARYSAFA M, et al. Text classification algorithms: a survey[J]. Information, 2019, 10(4): 150.
[6] 万家山, 吴云志. 基于深度学习的文本分类方法研究综述[J]. 天津理工大学学报, 2021, 37(2): 41-47.
WAN J S, WU Y Z. Review of text classification research based on deep learning[J]. Journal of Tianjin University of Technology, 2021, 37(2): 41-47.
[7] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems, Lake Tahoe, Nevada, USA, 2013: 3111-3119.
[8] PENNINGTON J, SOCHER R, MANNING C D. Glove: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1532-1543.
[9] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[10] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[11] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014: 1746-1751.
[12] JORDAN M I. A parallel distributed processing. approach[J]. Advances in Psychology, 1997, 121: 471-495.
[13] LAI S W, XU L H, LIU K, et al. Recurrent. convolutional neural networks for text classification[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2015: 2267-2273.
[14] ZHANG J R, LI Y X, TIAN J, et al. LSTM-CNN hybrid model for text classification[C]//Proceedings of the 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference, 2018: 1675-1680.
[15] SHE X Y, ZHANG D. Text classification based on hybrid CNN-LSTM hybrid model[C]//Proceedings of the 2018 11th International Symposium on Computational Intelligence and Design, 2018: 10144.
[16] LI C B, ZHAN G H, LI Z H. News text classification based on improved BiLSTM-CNN[C]//Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education, 2018: 890-893.
[17] SUN Z J, LI X Y, SUN X F, et al. ChineseBERT: Chinese pretraining enhanced by Glyph and Pinyin information[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021: 2065-2075.
[18] LI Y C, QIAN L F, MA J. Early detection of micro blog rumors based on BERT-RCNN model[J]. Information Studies: Theory & Application, 2021, 44(7): 173-177.
[19] SUN Y, WANG S, LI Y, et al. ERNIE: enhanced representation through knowledge integration[J]. arXiv:1904.09223, 2019.
[20] LIU Y, OTT M, GOYAL N, et al. Roberta: a robustly optimized BERT pre-training approach[J]. arXiv:1907.11692, 2019.