基于统计因果性及最优传输的文本分类模型

doi:10.3778/j.issn.1002-8331.2202-0140

摘要/Abstract

摘要： 近年来随着数据规模和算力水平的提高，深度学习及相关预训练模型如CNN、BERT等在文本分类领域取得了较快的进展。但这些模型仍然有提取分布特征能力不强、泛化性能较差等问题。目前针对这一问题，常见的做法是改进模型的结构或者扩充训练的数据集来改善性能，然而这些方法依赖于大量数据集和大量算力的网络结构修剪。因此提出一种基于格兰杰因果关系检验和最优传输理论的深度学习预训练模型优化方法。从数据分布角度出发，生成深度学习预训练模型中能够稳定提取分布信息的特征通路结构。在此基础上，基于最优传输距离给出特征通路结构的最优组合，生成在统计分布上具有稳定性的多视角结构化表征。理论分析和实验结果表明，该方法大幅降低了模型优化过程中数据和算力的要求。对比基于卷积结构的预训练模型如CNN，在20ng news、Ohsumed、R8数据集上分别有5、7和2个百分点的提升，对比基于Transformer结构的预训练模型如BERT分别有2、3和2个百分点的提升。

关键词: 文本分类, 格兰杰因果关系检验, 最优传输理论, 预训练模型

Abstract: In recent years, with the improvement of data scale and computing power, pre-training models such as CNN and BERT have made rapid progress in the field of text classification. However, these models have poor ability to extract distribution features and poor generalization performance in small-sample scenarios. At present, to address this problem, the common practice is to improve the structure of the model or expand the training data set to improve the performance. However, these methods rely on a large number of data sets and a large amount of computing power to prune the network structure. A pre-training model optimization method based on granger causality test and optimal transmission distance is proposed. From the perspective of data distribution, a feature pathway structure that can stably extract distribution information in the pre-training model is generated. On this basis, the optimal combination of characteristic path structures is given based on the optimal transmission distance, and a multi-view structured representation with stability in statistical distribution is generated. Theoretical analysis and experimental results show that this method greatly reduces the data and computing power requirements in the process of model optimization. The results show that compared with the pre-training model based on the convolution structure such as CNN, there are 5, 7 and 2?percentage points improvement in the 20ng news, Ohsumed, R8 data sets respectively；compared with the pre-training model based on the Transformer structure such as BERT, there are 2, 3 and 2?percentage points improvement respectively.

Key words: text classification, Granger causality test, optimal transport theory, pre-training model

聂挺, 邢凯, 李静娟. 基于统计因果性及最优传输的文本分类模型[J]. 计算机工程与应用, 2023, 59(11): 119-130.

NIE Ting, XING Kai, LI Jingjuan. Text Classification Model Based on Statistical Causality and Optimal Transmission#br#[J]. Computer Engineering and Applications, 2023, 59(11): 119-130.

参考文献

[1] SZCHNEIDER K M.A new feature selection score for multinomial naive Bayes text classification based on KL-divergence[C]//Proceedings of the ACL Interactive Poster and Demonstration Sessions，2004：186-189.
[2] SOUCY P，MINEAU G W.A simple KNN algorithm for text categorization[C]//Proceedings 2001 IEEE International Conference on Data Mining，2001：647-648.
[3] JOACHIMS T.Text categorization with support vector machines：learning with many relevant features[C]//European Conference on Machine Learning.Berlin，Heidelberg：Springer，1998：137-142.
[4] JOHNSON D E，OLES F J，ZHANG T.Decision-tree-based symbolic rule induction system for text categorization：U.S. Patent 6，519，580[P].2003-02-11.
[5] LI Q，PENG H，LI J，et al.A survey on text classification：from shallow to deep learning[J].arXiv：2008.00364，2020.
[6] ZHANG Y，WALLACE B.A sensitivity analysis of（and practitioners’ guide to） convolutional neural networks for sentence classification[J].arXiv：1510.03820，2015.
[7] SUNDERMEYER M，SCHLüTER R，NEY H.LSTM neural networks for language modeling[C]//Thirteenth Annual Conference of the International Speech Communication Association，2012.
[8] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017：5998-6008.
[9] YANG Z，YANG D，DYER C，et al.Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2016：1480-1489.
[10] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[11] YAO L，MAO C，LUO Y.Graph convolutional networks for text classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019，33（1）：7370-7377.
[12] HUANG L，MA D，LI S，et al.Text level graph neural network for text classification[J].arXiv：1910.02356，2019.
[13] LIN Y，MENG Y，SUN X，et al.BertGCN：transductive text classification by combining GCN and BERT[J].arXiv：2105.05727，2021.
[14] FEARNLEY J，GOLDBERG P W，HOLLENDER A，et al.The complexity of gradient descent：CLS=PPAD∩PLS[C]//Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing，2021：46-59.
[15] BELINKOV Y，BISK Y.Synthetic and natural noise both break neural machine translation[J].arXiv：1711.02173，2017.
[16] FENG S Y，GANGAL V，KANG D，et al.GenAug：data augmentation for finetuning text generators[J].arXiv：2010. 01794，2020.
[17] KURATA G，XIANG B，ZHOU B.Labeled data generation with encoder-decoder LSTM for semantic slot filling[C]//INTERSPEECH，2016：725-729.
[18] BAO Y，WU M，CHANG S，et al.Few-shot text classification with distributional signatures[J].arXiv：1908.06039，2019.
[19] HUANG Y，GILEDERELI B，K?KSAL A，et al.Balancing methods for multi-label text classification with long-tailed class distribution[J].arXiv：2109.04712，2021.
[20] SUTSKEVER I，MARTENS J，DAHL G，et al.On the importance of initialization and momentum in deep learning[C]//International Conference on Machine Learning，2013：1139-1147.
[21] DUCHI J，HAZAN E，SINGER Y.Adaptive subgradient methods for online learning and stochastic optimization[J].Journal of Machine Learning Research，2011，12（7）：2121-2159.
[22] KINGMA D P，BA J.Adam：a method for stochastic optimization[J].arXiv：1412.6980，2014.
[23] 蕾娜，顾险峰.最优传输理论与计算[M].北京：高等教育出版社，2021.
LEI N，GU X F.Optimal transmission theory and computation[M].Beijing：Higher Education Press，2021.
[24] BENGIO Y，COURVILLE A，VINCENT P.Representation learning：a review and new perspectives[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2013，35（8）：1798-1828.
[25] SUN C，QIU X，XU Y，et al.How to fine-tune BERT for text classification?[C]//China National Conference on Chinese Computational Linguistics.Cham：Springer，2019：194-206.
[26] DAI W，XUE G R，YANG Q，et al.Transferring Naive Bayes classifiers for text classification[C]//AAAI，2007：540-545.
[27] DEMPSTER A P，LAIRD N M，RUBIN D B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society：Series B（Methodological），1977，39（1）：1-22.
[28] PAVLINEK M，PODGORELEC V.Text classification method based on self-training and LDA topic models[J].Expert Systems with Applications，2017，80：83-93.
[29] TAN S.Neighbor-weighted k-nearest neighbor for unbalanced text corpus[J].Expert Systems with Applications，2005，28（4）：667-671.
[30] NIRMALADEVI M，ALIAS BALAMURUGAN S A，SWATHI U V.An amalgam KNN to predict diabetes mellitus[C]//2013 IEEE International Conference on Emerging Trends in Computing，Communication and Nanotechnology（ICECCN），2013：691-695.
[31] JOACHIMS T.Transductive inference for text classification using support vector machines[C]//ICML，1999：200-209.
[32] JOACHIMS T.A statistical learning model of text classification for SVMs[M]//Learning to classify text using support vector machines.Boston，MA：Springer，2002：45-74.
[33] VATEEKUL P，KUBAT M.Fast induction of multiple decision trees in text categorization from large scale，imbalanced，and multi-label data[C]//2009 IEEE International Conference on Data Mining Workshops，2009：320-325.
[34] MARON M E.Automatic indexing：an experimental inquiry[J].Journal of the ACM（JACM），1961，8（3）：404-417.
[35] BLEI D M，NG A Y，JORDAN M I.Latent Dirichlet allocation[J].Journal of Machine Learning Research，2003，3：993-1022.
[36] LESLIE C，ESKIN E，NOBLE W S.The spectrum kernel：a string kernel for SVM protein classification[C]//Biocomputing，2002：564-575.
[37] TAIRA H，HARUNO M.Feature selection in SVM text categorization[C]//AAAI/IAAI，1999：480-486.
[38] MITCHELL T M.Machine learning and data mining[J].Communications of the ACM，1999，42（11）：30-36.
[39] LAI S，XU L，LIU K，et al.Recurrent convolutional neural networks for text classification[C]//Twenty-ninth AAAI Conference on Artificial Intelligence，2015.
[40] LIU P，QIU X，HUANG X.Recurrent neural network for text classification with multi-task learning[J].arXiv：1605.05101，2016.
[41] TANG D，QIN B，LIU T.Document modeling with gated recurrent neural network for sentiment classification[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing，2015：1422-1432.
[42] PAPPAS N，POPESCU-BELIS A.Multilingual hierarchical attention networks for document classification[J].arXiv：1707.00896，2017.
[43] CHOROMANSKI K，LIKHOSHERSTOV V，DOHAN D.Rethinking attention with performers[J].arXiv：2009. 14794，2020.
[44] PENNINGTON J，SOCHER R，MANNING C D.Glove：global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing（EMNLP），2014：1532-1543.
[45] DEFFERRARD M，BRESSON X，VANDERGHEYNST P.Convolutional neural networks on graphs with fast localized spectral filtering[C]//Advances in Neural Information Processing Systems，2016.
[46] BACHMANN C M，AINSWORTH T L，FUSINA R A.Exploiting manifold geometry in hyperspectral imagery[J].IEEE Transactions on Geoscience and Remote Sensing，2005，43（3）：441-454.
[47] ARJOVSKY M，CHINTALA S，BOTTOU L.Wasserstein generative adversarial networks[C]//Proceedings of 2018 Machine Learning Research，2017：214-223.
[48] XU J，ZHOU H，GAN C，et al.Vocabulary learning via optimal transport for neural machine translation[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing（Volume 1：Long Papers），2021：7361-7373.
[49] WANG Y，ZHANG T，ZHANG X，et al.Wasserstein coupled graph learning for cross-modal retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision，2021：1813-1822.
[50] YAO A C C.Probabilistic computations：toward a unified measure of complexity[C]//18th Annual Symposium on Foundations of Computer Science，1977：222-227.
[51] BOTTOU L.Stochastic gradient descent tricks[M]//Neural networks：tricks of the trade.Berlin，Heidelberg：Springer，2012：421-436.
[52] BALDI P，SADOWSKI P J.Understanding dropout[C]//Advances in Neural Information Processing Systems，2013：2814-2822.
[53] WAN L，ZEILER M，ZHANG S，et al.Regularization of neural networks using dropconnect[C]//International Conference on Machine Learning，2013：1058-1066.
[54] ZHOU Z H，FENG J.Deep forest[J].arXiv：1702.08835，2017.
[55] WEI W W S.Time series analysis[M]//The Oxford handbook of quantitative methods in psychology.[S.l]：Oxford University Press，2006：15-80.
[56] PEARL J，MACKENZIE D.The book of why：the new science of cause and effect[M].[S.l.]：Basic Books，Inc，2018：6-29.
[57] GUPTA V，KOREN T，SINGER Y.Shampoo：preconditioned stochastic tensor optimization[C]//35th International Conference on Machine Learning，2018：2956-2964.
[58] CARBONNELLE S，DE VLEESCHOUWER C.Layer rotation：a surprisingly powerful indicator of generalization in deep networks?[J].arXiv：1806.01603，2018.
[59] MELLOR J，TURNER J，STORKEY A，et al.Neural architecture search without training[C]//International Conference on Machine Learning，2021：7588-7598.
[60] JENSEN M C，BLACK F，SCHOLES M S.The capital asset pricing model：some empirical tests[J].Social Science Electronic Publishing，1972：211-216.
[61] VAN LINT J H，WILSON R M，WILSON R M.A course in combinatorics[M].[S.l.]：Cambridge University Press，2001.