Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (9): 196-202.DOI: 10.3778/j.issn.1002-8331.2212-0357

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Chinese Long Text Classification Model Based on BERT Fused Chinese Input Methods and BLCG

YANG Wentao, LEI Yuqi, LI Xingyue, ZHENG Tiancheng   

  1. 1.School of Integrated Circuit, Huazhong University of Science and Technology, Wuhan 430074, China
    2.Department of Humanities, Zhixing College of Hubei University, Wuhan 430011, China
  • Online:2024-05-01 Published:2024-04-29

融合汉字输入法的BERT与BLCG的长文本分类研究

杨文涛,雷雨琦,李星月,郑天成   

  1. 1.华中科技大学 集成电路学院,武汉 430074
    2.湖北大学知行学院 人文学院,武汉 430011

Abstract: The existing Chinese long text classification models do not take into account the Chinese feature information such as phonetic and morphological, so they cannot fully represent Chinese semantic information. Meanwhile, the occurrence of some sentences containing many information which is either unrelated to the target topic or related to other topics, leads to misjudgment of the classifying model. In order to solve the problem, A Chinese long text classifying model based on CIMBERT (BERT fused Chinese input methods) and BLCG (BiLSTM fused CNN with gate) is proposed. Firstly, the representations of text vector are carried out by using the BERT model with adopting Chinese input methods. As for the input vector representations of BERT, Pinyin and Wubi which are widely used for Chinese character input methods, are applied to enhance the semantic information of Chinese characters. Furthermore, BLCG is constructed to extract the whole features of texts by means of utilizing LSTM (long short-term memory) method to obtain the global features and CNN (convolutional neural network) method to acquire the local features. The gating mechanism of BLCG can dynamically combine both global features and local features to overcome the classifying model faults owing to unable to identify unrelated topics of texts. Finally, proposed method is tested on THUCNews datasets and Sogou datasets. The results of the experiment show that the classification accuracy is 97.63%, 95.43% and F1-score is 97.68%, 95.49% respectively, which can indicate the purposed model is superior to other text classifying models to some extent.

Key words: long text classification, bidirectional encoder representations from Transformers (BERT), convolutional neural network (CNN), long short-term memory (LSTM), gating mechanism

摘要: 现有的中文长文本分类模型中,没有考虑汉字读音、笔画等特征信息,因此不能充分表示中文语义;同时,长文本中常常包含大量与目标主题无关的信息,甚至部分文本与其他主题相关,导致模型误判。为此,提出了一种融合汉字输入法的BERT(BERT fused Chinese input methods,CIMBERT)、带有门控机制的长短期记忆卷积网络(BiLSTM fused CNN with gating mechanism,BLCG)相结合的文本分类方法。该方法使用BERT模型进行文本的向量表示,在BERT模型的输入向量中,采用了拼音和五笔两种常用的汉字输入法,增强了汉字的语义信息。建立了BLCG模型进行文本特征提取,该模型使用双向长短期记忆网络(BiLSTM)进行全局特征提取、卷积神经网络(CNN)进行局部特征提取,并通过门控机制(gating mechanism)动态融合全局特征和局部特征,解决了部分文本与目标主题无关导致模型误判的问题。在THUCNews数据集与Sogou语料库上对该方法进行了验证,其准确率为97.63%、95.43%,F1-score为97.68%、95.49%,优于其他文本分类模型。

关键词: 长文本分类, BERT模型, 卷积神经网络, 长短期记忆网络, 门控机制