
计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (4): 192-210.DOI: 10.3778/j.issn.1002-8331.2310-0009
陈中涛,周亚同
出版日期:2025-02-15
发布日期:2025-02-14
CHEN Zhongtao, ZHOU Yatong
Online:2025-02-15
Published:2025-02-14
摘要: 目前基于种子词的弱监督文本分类算法大多需要从数据集中搜索所有种子词并以此扩展类别词典,出现频率较低的种子词的类别识别能力也较低。因此设计了一个简单且有效的弱监督中文文本分类算法(simple and effective weakly supervised Chinese text classification,SEWClass)。该方法利用预训练语言模型初始权重生成对文本的抽象理解,并以此为基础继续生成抽象约束条件和具象约束条件,以构建初次训练的伪标签数据;根据类别数量联合构建降维模型与分类器,以适应弱监督文本分类需要预先指定类别和在自训练过程中需要增加训练数据的特点;通过两种约束条件,伪标签数据拥有较高精确率,并在自训练过程中仅训练降维模型以提升召回率和算法效率。SEWClass对每个类别只需要一个种子词,如类别名称,即可完成分类任务,且SEWClass的性能与种子词是否出现在数据集中无关。SEWClass在THUCNews与toutiao两个中文数据集上的性能均远高于其他弱监督算法。
陈中涛, 周亚同. 简单且有效的弱监督中文文本分类算法[J]. 计算机工程与应用, 2025, 61(4): 192-210.
CHEN Zhongtao, ZHOU Yatong. Simple and Effective Weakly Supervised Chinese Text Classification Algorithm[J]. Computer Engineering and Applications, 2025, 61(4): 192-210.
| [1] TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2015: 1556-1566. [2] ZHU X, SOBIHANI P, GUO H. Long short-term memory over recursive structures[C]//Proceedings of the 32nd International Conference on Machine Learning, 2015: 1604-1612. [3] LIU P, QIU X, CHEN X, et al. Multi-timescale long short-term memory neural network for modelling sentences and documents[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2015: 2326-2335. [4] RAJPURKAR P, ZHANG J, LOPYREV K, et al. SQuAD: 100,000+ questions for machine comprehension of text[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2016: 2383-2392. [5] CHEN J, GONG Z, LIU W. A Dirichlet process BiTERM-based mixture model for short text stream clustering[J]. Applied Intelligence, 2020, 50(5): 1609-1619. [6] CHEN J, GONG Z, LIU W. A nonparametric model for online topic discovery with word embeddings[J]. Information Sciences, 2019, 504: 32-47. [7] KIM Y, LI P, HUANG H. Convolutional neural networks for sentence classification[J]. arXiv:1408.5882, 2014. [8] ZHAO Y, SHEN Y, YAO J. Recurrent neural network for text classification with hierarchical multiscale dense connections[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019: 5450-5456. [9] YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2016: 1480-1489. [10] HUANG L, MA D, LI S, et al. Text level graph neural network for text classification[J]. arXiv:1910.02356, 2019. [11] MELAMUD O, BORNEA M, BARKER K. Combining unsupervised pre-training and annotator rationales to improve low-shot text classification[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 3884-3893. [12] MENG Y, ZHANG Y, HUANG J, et al. Topic discovery via latent space clustering of pretrained language model representations[C]//Proceedings of the ACM Web Conference 2022. New York: ACM, 2022: 3143-3152. [13] GROOTENDORST M. BERTopic: neural topic modeling with a class-based TF-IDF procedure[J]. arXiv:2203.05794, 2022. [14] CHEN Z, MI C, DUO S, et al. ClusTop: an unsupervised and integrated text clustering and topic extraction framework[J]. arXiv:2301.00818, 2023. [15] YOON S, MENG Y, LEE D, et al. SCStory: self-supervised and continual online story discovery[C]//Proceedings of the ACM Web Conference 2023. New York: ACM, 2023: 1853-1864. [16] SCHICK T, SCHüTZE H. Exploiting cloze-questions for few-shot text classification and natural language inference[C]//Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2021: 255-269. [17] MIN S, LEWIS M, HAJISHIRZI H, et al. Noisy channel language model prompting for few-shot text classification[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 5316-5330. [18] PRYZANT R, YANG Z, XU Y, et al. Automatic rule induction for efficient semi-supervised learning[C]//Findings of the Association for Computational Linguistics: EMNLP 2022. Stroudsburg: ACL, 2022: 28-44. [19] MAHESHWARI A, KILLAMSETTY K, RAMAKRISHNAN G, et al. Learning to robustly aggregate labeling functions for semi-supervised data programming[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: ACL, 2022: 1188-1202. [20] MENG Y, ZHANG Y, HUANG J, et al. Text classification using label names only: a language model self-training approach[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2020: 9006-9017. [21] MEKALA D, SHANG J. Contextualized weak supervision for text classification[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 323-333. [22] WANG Z, MEKALA D, SHANG J. X-class: text classification with extremely weak supervision[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2021: 3043-3053. [23] KARGUPTA P, KOMARLU T, YOON S, et al. MEGClass: extremely weakly supervised text classification via mutually-enhancing text granularities[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: ACL, 2023: 10543-10558. [24] ZHANG Y, ZHANG Y, MICHALSKI M, et al. Effective seed-guided topic discovery by integrating multiple types of contexts[C]//Proceedings of the 16th ACM International Conference on Web Search and Data Mining. New York: ACM, 2023: 429-437. [25] YOON S, CHAN H P, HAN J, et al. PDSum: prototype-driven continuous summarization of evolving multi-document sets stream[C]//Proceedings of the ACM Web Conference 2023. New York: ACM, 2023: 1650-1661. [26] MEKALA D, DONG C, SHANG J. LOPS: learning order inspired pseudo-label selection for weakly supervised text classification[C]//Findings of the Association for Computational Linguistics: EMNLP 2022. Stroudsburg: ACL, 2022: 4894-4908. [27] ZENG Z, NI W, FANG T, et al. Weakly supervised text classification using supervision signals from a language model[C]//Findings of the Association for Computational Linguistics: NAACL 2022. Stroudsburg: ACL, 2022: 2295-2305. [28] ZHAO X, OUYANG S, YU Z, et al. Pre-trained language models can be fully zero-shot learners[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2023: 15590-15606. [29] ZHANG Y, JIANG M, MENG Y, et al. PIEClass: weakly-supervised text classification with prompting and noise-robust iterative ensemble training[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2023: 12655-12670. [30] CLARK K, LUONG M T, LE Q V, et al. ELECTRA: pre-training text encoders as discriminators rather than generators[J]. arXiv:2003.10555, 2020. [31] ZHANG Y, GARG S, MENG Y, et al. MotifClass: weakly supervised text classification with higher-order metadata information[C]//Proceedings of the 15th ACM International Conference on Web Search and Data Mining. New York: ACM, 2022: 1357-1367. [32] MENG Y, HUANG J, WANG G, et al. Discriminative topic mining via category-name guided text embedding[C]//Proceedings of the Web Conference 2020. New York: ACM, 2020: 2121-2132. [33] TAO F, ZHANG C, CHEN X, et al. Doc2Cube: allocating documents to text cube without labeled data[C]//Proceedings of the 2018 IEEE International Conference on Data Mining. Piscataway: IEEE, 2018: 1260-1265. [34] HU S, DING N, WANG H, et al. Knowledgeable prompt-tuning: incorporating knowledge into prompt verbalizer for text classification[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2022: 2225-2240. [35] PARK S, LEE J. LIME: weakly-supervised text classification without seeds[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 1083-1088. [36] ZHANG R, YU Y, SHETTY P, et al. PRBoost: prompt-based rule discovery and boosting for interactive weakly-supervised learning[J]. arXiv:2203.09735, 2022. [37] TüRKER R, ZHANG L, ALAM M, et al. Weakly supervised short text categorization using world knowledge[C]//Proceedings of the 19th International Semantic Web Conference. Cham: Springer, 2020: 584-600. [38] TANG J, QU M, WANG M, et al. LINE: large-scale information network embedding[C]//Proceedings of the 24th International Conference on World Wide Web. New York: ACM, 2015: 1067-1077. [39] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning, 2014: 2931-2939. [40] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems 26, 2013. [41] CHENG H T, KOC L, HARMSEN J, et al. Wide & deep learning for recommender systems[C]//Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. New York: ACM, 2016: 7-10. [42] KIM D, KOO J, KIM U M. OSP-class: open set pseudo-labeling with noise robust training for text classification[C]//Proceedings of the 2022 IEEE International Conference on Big Data. Piscataway: IEEE, 2022: 5520-5529. [43] WANG S, DUAN C, YANG Y. Weakly supervised Chinese short text classification algorithm based on ConWea model[C]//Proceedings of the 2022 2nd International Conference on Advanced Technologies in Intelligent Control, Environment, Computing & Communication Engineering. Piscataway: IEEE, 2022: 1-6. [44] ZHANG L, DING J, XU Y, et al. Weakly-supervised text classification based on keyword graph[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2021: 2803-2813. [45] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019: 4171-4186. [46] LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[J]. arXiv:1907.11692, 2019. [47] LAN Z, CHEN M, GOODMAN S, et al. ALBERT: a lite BERT for self-supervised learning of language representations[J]. arXiv:1909.11942, 2019. [48] REIMERS N, GUREVYCH I. Sentence-BERT: sentence embeddings using Siamese BERT-networks[J]. arXiv:1908. 10084, 2019. [49] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. [2023-09-29]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. [50] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9. [51] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems 33, 2020: 1877-1901. [52] 苏剑林. 提速不掉点: 基于词颗粒度的中文WoBERT[EB/OL]. (2020-09-18)[2023-09-27]. https://kexue.fm/archives/7758. SU J L. Speed up without degradation: Chinese WoBERT based on word granularity[EB/OL]. (2020-09-18)[2023-09-27]. https://kexue.fm/archives/7758. [53] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551. [54] XIA M, ARTETXE M, DU J, et al. Prompting ELECTRA: few-shot learning with discriminative pre-trained models[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2022: 11351-11361. [55] LI Z, LI S, ZHOU G. Pre-trained token-replaced detection model as few-shot learner[C]//Proceedings of the 29th International Conference on Computational Linguistics, 2022: 3274-3284. [56] YAO Y, DONG B, ZHANG A, et al. Prompt tuning for discriminative pre-trained language models[C]//Findings of the Association for Computational Linguistics: ACL 2022. Stroudsburg: ACL, 2022: 3468-3473. [57] LANG H, AGRAWAL M, KIM Y, et al. Co-training improves prompt-based learning for large language models[C]//Proceedings of the 39th International Conference on Machine Learning, 2022: 11985-12003. [58] JIANG T, JIAO J, HUANG S, et al. PromptBERT: improving BERT sentence embeddings with prompts[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2022: 8826-8837. [59] HORNIK K. Approximation capabilities of multilayer feedforward networks[J]. Neural Networks, 1991, 4(2): 251-257. [60] AHARONI R, GOLDBERG Y. Unsupervised domain clusters in pretrained language models[J]. arXiv:2004.02105, 2020. [61] GAO T, FISCH A, CHEN D. Making pre-trained language models better few-shot learners[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2021: 3816-3830. [62] ZHONG Z, FRIEDMAN D, CHEN D. Factual probing is [MASK]: learning vs. learning to recall[J]. arXiv:2104.05240, 2021. [63] WANG A, SINGH A, MICHAEL J, et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding[J]. arXiv:1804.07461, 2018. [64] CER D, DIAB M, AGIRRE E, et al. SemEval-2017 task 1: semantic textual similarity multilingual and Crosslingual focused evaluation[C]//Proceedings of the 11th International Workshop on Semantic Evaluation. Stroudsburg: ACL, 2017: 1-14. |
| [1] | 黄山, 范慧杰, 林森, 曹镜涵, 唐延东. 基于扩散方法的特征动态库[J]. 计算机工程与应用, 2025, 61(5): 241-249. |
| [2] | 朱恩德, 王威, 高见. 融合BiLSTM与CNN的推特黑灰产分类模型[J]. 计算机工程与应用, 2025, 61(1): 186-195. |
| [3] | 江结林, 朱永伟, 许小龙, 崔燕, 赵英男. 混合特征及多头注意力的中文短文本分类[J]. 计算机工程与应用, 2024, 60(9): 237-243. |
| [4] | 宋建平, 王毅, 孙开伟, 刘期烈. 结合双曲图注意力网络与标签信息的短文本分类方法[J]. 计算机工程与应用, 2024, 60(9): 188-195. |
| [5] | 杨文涛, 雷雨琦, 李星月, 郑天成. 融合汉字输入法的BERT与BLCG的长文本分类研究[J]. 计算机工程与应用, 2024, 60(9): 196-202. |
| [6] | 胡志强, 李朋骏, 王金龙, 熊晓芸. 基于ChatGPT增强和监督对比学习的政策工具归类研究[J]. 计算机工程与应用, 2024, 60(7): 292-305. |
| [7] | 陈钊鸿, 洪智勇, 余文华, 张昕. 采用平衡函数的大规模多标签文本分类[J]. 计算机工程与应用, 2024, 60(4): 163-172. |
| [8] | 王旭阳, 耿留青, 张鑫. 结合DistilBERT与标签关联性的多标签文本分类[J]. 计算机工程与应用, 2024, 60(23): 168-175. |
| [9] | 郭瑞强, 杨世龙, 贾晓文, 魏谦强. 基于标签增强的细粒度文本分类[J]. 计算机工程与应用, 2024, 60(21): 134-141. |
| [10] | 苏易礌, 李卫军, 刘雪洋, 丁建平, 刘世侠, 李浩南, 李贯峰. 基于图神经网络的文本分类方法研究综述[J]. 计算机工程与应用, 2024, 60(19): 1-17. |
| [11] | 李建东, 傅佳, 李佳琦. 融合双向注意力和对比增强机制的多标签文本分类[J]. 计算机工程与应用, 2024, 60(16): 105-115. |
| [12] | 杨春霞, 黄昱锟, 闫晗, 吴亚雷. 融合GAT与头尾标签的多标签文本分类模型[J]. 计算机工程与应用, 2024, 60(15): 150-160. |
| [13] | 董晓辉, 郭庭甫, 朱海江, 党小超, 李芬芳. 面向矿井提升机的故障知识图谱构建与应用[J]. 计算机工程与应用, 2024, 60(14): 348-356. |
| [14] | 顾勋勋, 刘建平, 邢嘉璐, 任海玉. 文本分类中Prompt Learning方法研究综述[J]. 计算机工程与应用, 2024, 60(11): 50-61. |
| [15] | 于俊伟, 郭园森, 张自豪, 母亚双. 弱监督显著性目标检测研究进展[J]. 计算机工程与应用, 2024, 60(10): 1-15. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||