简单且有效的弱监督中文文本分类算法

doi:10.3778/j.issn.1002-8331.2310-0009

摘要/Abstract

摘要： 目前基于种子词的弱监督文本分类算法大多需要从数据集中搜索所有种子词并以此扩展类别词典，出现频率较低的种子词的类别识别能力也较低。因此设计了一个简单且有效的弱监督中文文本分类算法（simple and effective weakly supervised Chinese text classification，SEWClass）。该方法利用预训练语言模型初始权重生成对文本的抽象理解，并以此为基础继续生成抽象约束条件和具象约束条件，以构建初次训练的伪标签数据；根据类别数量联合构建降维模型与分类器，以适应弱监督文本分类需要预先指定类别和在自训练过程中需要增加训练数据的特点；通过两种约束条件，伪标签数据拥有较高精确率，并在自训练过程中仅训练降维模型以提升召回率和算法效率。SEWClass对每个类别只需要一个种子词，如类别名称，即可完成分类任务，且SEWClass的性能与种子词是否出现在数据集中无关。SEWClass在THUCNews与toutiao两个中文数据集上的性能均远高于其他弱监督算法。

关键词: 弱监督, 文本分类, 自训练, 种子词

Abstract: Most of the current weakly supervised text classification algorithms based on seed words need to search all seed words from the dataset and extend the category dictionary in this way, and the category recognition ability of seed words that occur less frequently is also lower. Therefore, a simple and effective weakly supervised Chinese text classification (SEWClass) algorithm is designed, which uses the initial weights of the pre-trained language model to generate an abstract understanding of the text and continues to generate abstract constraints and figurative constraints based on this to construct the initial training. Based on the number of categories, a dimensionality reduction model and a classifier are jointly constructed to adapt to the fact that the weakly supervised text classification needs to be specified in advance, and needs to increase training data during self-training. With the two constraints, the pseudo-labeled data have a high precision rate, and only the dimensionality reduction model is trained during self-training to improve the recall and efficiency. SEWClass requires only one seed word, such as the category name, to complete the classification task, and the performance of SEWClass is independent whether or not the seed word occurs in the dataset. The performance of SEWClass on both Chinese datasets, THUCNews and toutiao, is much higher than that of other weakly supervised algorithms.

Key words: weakly supervised, text classification, self-training, seed word

陈中涛, 周亚同. 简单且有效的弱监督中文文本分类算法[J]. 计算机工程与应用, 2025, 61(4): 192-210.

CHEN Zhongtao, ZHOU Yatong. Simple and Effective Weakly Supervised Chinese Text Classification Algorithm[J]. Computer Engineering and Applications, 2025, 61(4): 192-210.

参考文献

[1] TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2015: 1556-1566.
[2] ZHU X, SOBIHANI P, GUO H. Long short-term memory over recursive structures[C]//Proceedings of the 32nd International Conference on Machine Learning, 2015: 1604-1612.
[3] LIU P, QIU X, CHEN X, et al. Multi-timescale long short-term memory neural network for modelling sentences and documents[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2015: 2326-2335.
[4] RAJPURKAR P, ZHANG J, LOPYREV K, et al. SQuAD: 100,000+ questions for machine comprehension of text[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2016: 2383-2392.
[5] CHEN J, GONG Z, LIU W. A Dirichlet process BiTERM-based mixture model for short text stream clustering[J]. Applied Intelligence, 2020, 50(5): 1609-1619.
[6] CHEN J, GONG Z, LIU W. A nonparametric model for online topic discovery with word embeddings[J]. Information Sciences, 2019, 504: 32-47.
[7] KIM Y, LI P, HUANG H. Convolutional neural networks for sentence classification[J]. arXiv:1408.5882, 2014.
[8] ZHAO Y, SHEN Y, YAO J. Recurrent neural network for text classification with hierarchical multiscale dense connections[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019: 5450-5456.
[9] YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2016: 1480-1489.
[10] HUANG L, MA D, LI S, et al. Text level graph neural network for text classification[J]. arXiv:1910.02356, 2019.
[11] MELAMUD O, BORNEA M, BARKER K. Combining unsupervised pre-training and annotator rationales to improve low-shot text classification[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 3884-3893.
[12] MENG Y, ZHANG Y, HUANG J, et al. Topic discovery via latent space clustering of pretrained language model representations[C]//Proceedings of the ACM Web Conference 2022. New York: ACM, 2022: 3143-3152.
[13] GROOTENDORST M. BERTopic: neural topic modeling with a class-based TF-IDF procedure[J]. arXiv:2203.05794, 2022.
[14] CHEN Z, MI C, DUO S, et al. ClusTop: an unsupervised and integrated text clustering and topic extraction framework[J]. arXiv:2301.00818, 2023.
[15] YOON S, MENG Y, LEE D, et al. SCStory: self-supervised and continual online story discovery[C]//Proceedings of the ACM Web Conference 2023. New York: ACM, 2023: 1853-1864.
[16] SCHICK T, SCHüTZE H. Exploiting cloze-questions for few-shot text classification and natural language inference[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2021: 255-269.
[17] MIN S, LEWIS M, HAJISHIRZI H, et al. Noisy channel language model prompting for few-shot text classification[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 5316-5330.
[18] PRYZANT R, YANG Z, XU Y, et al. Automatic rule induction for efficient semi-supervised learning[C]//Findings of the Association for Computational Linguistics: EMNLP 2022. Stroudsburg: ACL, 2022: 28-44.
[19] MAHESHWARI A, KILLAMSETTY K, RAMAKRISHNAN G, et al. Learning to robustly aggregate labeling functions for semi-supervised data programming[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: ACL, 2022: 1188-1202.
[20] MENG Y, ZHANG Y, HUANG J, et al. Text classification using label names only: a language model self-training approach[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 9006-9017.
[21] MEKALA D, SHANG J. Contextualized weak supervision for text classification[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 323-333.
[22] WANG Z, MEKALA D, SHANG J. X-class: text classification with extremely weak supervision[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2021: 3043-3053.
[23] KARGUPTA P, KOMARLU T, YOON S, et al. MEGClass: extremely weakly supervised text classification via mutually-enhancing text granularities[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: ACL, 2023: 10543-10558.
[24] ZHANG Y, ZHANG Y, MICHALSKI M, et al. Effective seed-guided topic discovery by integrating multiple types of contexts[C]//Proceedings of the 16th ACM International Conference on Web Search and Data Mining. New York: ACM, 2023: 429-437.
[25] YOON S, CHAN H P, HAN J, et al. PDSum: prototype-driven continuous summarization of evolving multi-document sets stream[C]//Proceedings of the ACM Web Conference 2023. New York: ACM, 2023: 1650-1661.
[26] MEKALA D, DONG C, SHANG J. LOPS: learning order inspired pseudo-label selection for weakly supervised text classification[C]//Findings of the Association for Computational Linguistics: EMNLP 2022. Stroudsburg: ACL, 2022: 4894-4908.
[27] ZENG Z, NI W, FANG T, et al. Weakly supervised text classification using supervision signals from a language model[C]//Findings of the Association for Computational Linguistics: NAACL 2022. Stroudsburg: ACL, 2022: 2295-2305.
[28] ZHAO X, OUYANG S, YU Z, et al. Pre-trained language models can be fully zero-shot learners[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 15590-15606.
[29] ZHANG Y, JIANG M, MENG Y, et al. PIEClass: weakly-supervised text classification with prompting and noise-robust iterative ensemble training[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2023: 12655-12670.
[30] CLARK K, LUONG M T, LE Q V, et al. ELECTRA: pre-training text encoders as discriminators rather than generators[J]. arXiv:2003.10555, 2020.
[31] ZHANG Y, GARG S, MENG Y, et al. MotifClass: weakly supervised text classification with higher-order metadata information[C]//Proceedings of the 15th ACM International Conference on Web Search and Data Mining. New York: ACM, 2022: 1357-1367.
[32] MENG Y, HUANG J, WANG G, et al. Discriminative topic mining via category-name guided text embedding[C]//Proceedings of the Web Conference 2020. New York: ACM, 2020: 2121-2132.
[33] TAO F, ZHANG C, CHEN X, et al. Doc2Cube: allocating documents to text cube without labeled data[C]//Proceedings of the 2018 IEEE International Conference on Data Mining. Piscataway: IEEE, 2018: 1260-1265.
[34] HU S, DING N, WANG H, et al. Knowledgeable prompt-tuning: incorporating knowledge into prompt verbalizer for text classification[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 2225-2240.
[35] PARK S, LEE J. LIME: weakly-supervised text classification without seeds[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 1083-1088.
[36] ZHANG R, YU Y, SHETTY P, et al. PRBoost: prompt-based rule discovery and boosting for interactive weakly-supervised learning[J]. arXiv:2203.09735, 2022.
[37] TüRKER R, ZHANG L, ALAM M, et al. Weakly supervised short text categorization using world knowledge[C]//Proceedings of the 19th International Semantic Web Conference. Cham: Springer, 2020: 584-600.
[38] TANG J, QU M, WANG M, et al. LINE: large-scale information network embedding[C]//Proceedings of the 24th International Conference on World Wide Web. New York: ACM, 2015: 1067-1077.
[39] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning, 2014: 2931-2939.
[40] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems 26, 2013.
[41] CHENG H T, KOC L, HARMSEN J, et al. Wide & deep learning for recommender systems[C]//Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. New York: ACM, 2016: 7-10.
[42] KIM D, KOO J, KIM U M. OSP-class: open set pseudo-labeling with noise robust training for text classification[C]//Proceedings of the 2022 IEEE International Conference on Big Data. Piscataway: IEEE, 2022: 5520-5529.
[43] WANG S, DUAN C, YANG Y. Weakly supervised Chinese short text classification algorithm based on ConWea model[C]//Proceedings of the 2022 2nd International Conference on Advanced Technologies in Intelligent Control, Environment, Computing & Communication Engineering. Piscataway: IEEE, 2022: 1-6.
[44] ZHANG L, DING J, XU Y, et al. Weakly-supervised text classification based on keyword graph[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 2803-2813.
[45] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186.
[46] LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[J]. arXiv:1907.11692, 2019.
[47] LAN Z, CHEN M, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[J]. arXiv:1909.11942, 2019.
[48] REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[J]. arXiv:1908.
10084, 2019.
[49] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. [2023-09-29]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[50] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[51] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, 2020: 1877-1901.
[52] 苏剑林. 提速不掉点: 基于词颗粒度的中文WoBERT[EB/OL]. (2020-09-18)[2023-09-27]. https://kexue.fm/archives/7758.
SU J L. Speed up without degradation: Chinese WoBERT based on word granularity[EB/OL]. (2020-09-18)[2023-09-27]. https://kexue.fm/archives/7758.
[53] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.
[54] XIA M, ARTETXE M, DU J, et al. Prompting ELECTRA: few-shot learning with discriminative pre-trained models[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2022: 11351-11361.
[55] LI Z, LI S, ZHOU G. Pre-trained token-replaced detection model as few-shot learner[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 3274-3284.
[56] YAO Y, DONG B, ZHANG A, et al. Prompt tuning for discriminative pre-trained language models[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: ACL, 2022: 3468-3473.
[57] LANG H, AGRAWAL M, KIM Y, et al. Co-training improves prompt-based learning for large language models[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 11985-12003.
[58] JIANG T, JIAO J, HUANG S, et al. PromptBERT: improving BERT sentence embeddings with prompts[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2022: 8826-8837.
[59] HORNIK K. Approximation capabilities of multilayer feedforward networks[J]. Neural Networks, 1991, 4(2): 251-257.
[60] AHARONI R, GOLDBERG Y. Unsupervised domain clusters in pretrained language models[J]. arXiv:2004.02105, 2020.
[61] GAO T, FISCH A, CHEN D. Making pre-trained language models better few-shot learners[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 3816-3830.
[62] ZHONG Z, FRIEDMAN D, CHEN D. Factual probing is [MASK]: learning vs. learning to recall[J]. arXiv:2104.05240, 2021.
[63] WANG A, SINGH A, MICHAEL J, et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding[J]. arXiv:1804.07461, 2018.
[64] CER D, DIAB M, AGIRRE E, et al. SemEval-2017 task 1: semantic textual similarity multilingual and Crosslingual focused evaluation[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. Stroudsburg: ACL, 2017: 1-14.