监督式主题模型及其应用综述

doi:10.3778/j.issn.1002-8331.2309-0030

摘要/Abstract

摘要： 主题模型是一种数据挖掘的方法，可以自动地从大量文件或数据中提取潜在的模式或主题，并将对应的数据分配到相应的模式或主题中。主题模型已广泛应用于文本聚类或分类、主题抽取、主题演变、情感分析和摘要总结等领域。监督式主题模型和非监督主题模型的区别在于是否依赖标注信息。近年来，监督式主题模型在数据挖掘任务中逐渐兴起，使得越来越多的任务倾向于采用监督式方法进行优化。陈述了监督式主题模型相关内容，介绍常用的数据集和评价指标；分别从模型和应用的角度对各种类型的监督式主题模型进行了深入对比分析。最后，阐述了主题模型当前研究所面临的挑战，并对未来监督式主题模型的研究方向进行展望。

关键词: 数据挖掘, 监督式主题模型, 主题预测, 主题演变

Abstract: Topic model is a data mining method that can automatically extract potential patterns or topics from a large number of files or data, and assign the corresponding data to the corresponding patterns or topics. Topic models have been widely used in the fields of text clustering or classification, topic extraction, topic evolution, sentiment analysis and summary. The difference between a supervised topic model and an unsupervised topic model is whether it relies on annotation information. In recent years, supervised topic model has gradually emerged in data mining tasks, which makes more and more tasks tend to adopt supervised method for optimization. Firstly, the content of supervised topic model is presented, and the commonly used data sets and evaluation indicators are introduced. Secondly, from the perspective of model and application, different types of supervised topic models are analyzed in depth. Finally, the challenges facing the current research of thematic models are described, and the future research direction of supervised thematic models is prospected.

Key words: data mining, supervised topic model, topic prediction, topic evolution

王振彪, 徐贞顺, 刘纳, 张文豪, 唐增金, 王正安. 监督式主题模型及其应用综述[J]. 计算机工程与应用, 2024, 60(8): 56-68.

WANG Zhenbiao, XU Zhenshun, LIU Na, ZHANG Wenhao, TANG Zengjin, WANG Zheng’an. Review of Supervised Topic Models and Applications[J]. Computer Engineering and Applications, 2024, 60(8): 56-68.

参考文献

[1] BLEI D M. Probabilistic topic models[J]. Communications of the ACM, 2012, 55(4): 77-84.
[2] ALGHAMDI R, ALFALQI K. A survey of topic modeling in text mining[J]. Int J Adv Comput Sci Appl (IJACSA), 2015, 6(1).
[3] CHURCHILL R, SINGH L. The evolution of topic modeling[J]. ACM Computing Surveys, 2022, 54(10S): 1-35.
[4] 韩亚楠, 刘建伟, 罗雄麟.概率主题模型综述[J].计算机学报, 2021, 44(6): 1095-1139.
HAN Y N, LIU J W, LUO X L.A survey on probabilistic topic model[J].Chinese Journal of Computers, 2021, 44(6): 1095-1139.
[5] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of machine Learning Research, 2003, 3(1): 993-1022.
[6] MCAULIFFE J, BLEI D. Supervised topic models[C]//Advances in Neural Information Processing Systems, 2007.
[7] CHEN Z, LIU B. Mining topics in documents: standing on the shoulders of big data[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014: 1116-1125.
[8] MIMNO D, WALLACH H, TALLEY E, et al. Optimizing semantic coherence in topic models[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011: 262-272.
[9] HOYLE A, GOEL P, HIAN-CHEONG A, et al. Is automated topic model evaluation broken? the incoherence of coherence[C]//Advances in Neural Information Processing Systems, 2021: 2018-2033.
[10] WILCOX K T, JACOBUCCI R, ZHANG Z, et al. Supervised latent Dirichlet allocation with covariates: a Bayesian structural and measurement model of text and covariates[J]. Psychological Methods, 2021.DOI:10.31234/osf.io/62tc3.
[11] VU D, TRUONG K, NGUYEN K, et al. Revisiting supervised word embeddings[J]. J Inf Sci Eng, 2022, 38(2): 413-427.
[12] XU W, EGUCHI K. A supervised topic embedding model and its application[J]. Plos One, 2022, 17(11): e0277104.
[13] CHURCHILL R, SINGH L. Topic-noise models: modeling topic and noise distributions in social media post collections[C]//Proceedings of the 21st IEEE International Conference on Data Mining, 2021: 71-80.
[14] CHURCHILL R, SINGH L, RYAN R, et al. A guided topic-noise model for short texts[C]//Proceedings of the 31st ACM World Wide Web Conference, 2022: 2870-2878.
[15] RAHIMI M, ZAHEDI M, MASHAYEKHI H. A probabilistic topic model based on short distance Co-occurrences[J]. Expert Systems with Applications, 2022, 193: 116518.
[16] GROOTENDORST M. BERTopic: neural topic modeling with a class-based TF-IDF procedure[J]. arXiv:2203.05794, 2022
[17] ZHAO H, PHUNG D, HUYNH V, et al. Topic modelling meets deep neural networks: a survey[J]. arXiv:2103.00498, 2021.
[18] FENG J, ZHANG Z, DING C, et al. Context reinforced neural topic modeling over short texts[J]. Information Sciences, 2022, 607: 79-91.
[19] LIU L, HUANG H, GAO Y, et al. Improving neural topic modeling via Sinkhorn divergence[J]. Information Processing and Management, 2022, 59(3): 102864.
[20] YANG Y, ZHANG K, FAN Y. sDTM: a supervised bayesian deep topic model for text analytics[J]. Information Systems Research, 2023, 34(1): 137-156.
[21] MURSHED B A H, MALLAPPA S, ABAWAJY J, et al. Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis[J]. Artificial Intelligence Review, 2023, 56(6): 5133-5260.
[22] WANG C, BLEI D M, FEI-FEI L. Simultaneous image classification and annotation[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009: 1903-1910.
[23] RAMAGE D, HALL D, NALLAPATI R, et al. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009: 248-256.
[24] ZHU J, AHMED A, XING E P. MedLDA: maximum margin supervised topic models for regression and classification[C]//Proceedings of the 26th Annual International Conference on Machine Learning, 2009: 1257-1264.
[25] CHEN J, HE J, SHEN Y, et al. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture[C]//Advances in Neural Information Processing Systems, 2015.
[26] ZHANG Y, MA J, WANG Z, et al. LF-LDA: a topic model for multi-label classification[C]//Advances in Internetworking, Data & Web Technologies, 2018.
[27] WANG W, GUO B, SHEN Y, et al. Twin labeled LDA: a supervised topic model for document classification[J]. Applied Intelligence, 2020, 50(12): 4602-4615.
[28] ZHANG G, ZHENG H, LIU X. Co-STM text categorization method based on supervised topic model[C]//Proceedings of the 4th International Conference on Advanced Electronic Materials, Computers and Software Engineering, 2021.
[29] NGUYEN T, TUAN L A. Contrastive learning for neural topic model[C]//Proceedings of the 35th Conference on Neural Information Processing Systems, 2021: 11974-11986.
[30] TANG R, YANG C, WANG Y. A cross-domain multimodal supervised latent topic model for item tagging and cold-start recommendation[J]. IEEE MultiMedia, 2023, 30(3): 48-62.
[31] ZHU B, CAI Y, REN H. Graph neural topic model with commonsense knowledge[J]. Information Processing & Management, 2023, 60(2): 103215.
[32] LI P, TSENG C, ZHENG Y, et al. Guided semi-supervised non-negative matrix factorization[J]. Algorithms, 2022, 15(5): 136.
[33] LI X, WANG B, WANG Y, et al. Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals[J]. Soft Computing, 2023, 27(9): 5397-5410.
[34] ADELANI D I, MASIAK M, AZIME I A, et al. Masakha-NEWS: news topic classification for african languages[J]. arXiv:2304.09972, 2023.
[35] LI Y, NAIR P, LU X H, et al. Inferring multimodal latent topics from electronic health records[J]. Nature Communications, 2020, 11(1): 2536.
[36] SONG Z, TORAL X S, XU Y, et al. Supervised multi-specialist topic model with applications on large-scale electronic health record data[C]//Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 2021: 1-26.
[37] WANG Y, BENAVIDES R, DIATCHENKO L, et al. A graph-embedded topic model enables characterization of diverse pain phenotypes among UK biobank individuals[J]. Iscience, 2022, 25(6): 104390.
[38] XIE Q, TIWARI P, GUPTA D, et al. Neural variational sparse topic model for sparse explainable text representation[J]. Information Processing and Management, 2021, 58(5): 102614.
[39] ZHANG D C, LAUW H W. Variational graph author topic modeling[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022: 2429-2438.
[40] ZHOU T, LAW K, CREIGHTON D. A weakly-supervised graph-based joint sentiment topic model for multi-topic sentiment analysis[J]. Information Sciences, 2022, 609: 1030-1051.
[41] MALLICK T, BERGERSON J D, VERNER D R, et al. Analyzing the impact of climate change on critical infrastructure from the scientific literature: a weakly supervised NLP approach[J]. arXiv:2302.01887, 2023.
[42] COPUR-GENCTURK Y, CHOI H J, COHEN A. Investigating teachers’ understanding through topic modeling: a promising approach to studying teachers’ knowledge[J]. Journal of Mathematics Teacher Education, 2023, 26(3): 281-302.
[43] ZHANG Y, ZHANG Y, MICHALSKI M, et al. Effective seed-guided topic discovery by integrating multiple types of contexts[C]//Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, 2023: 429-437.
[44] BLEI D M, LAFFERTY J D. Dynamic topic models[C]//Proceedings of the 23rd International Conference on Machine Learning, 2006: 113-120.
[45] NALLAPATI R M, DITMORE S, LAFFERTY J D, et al. Multiscale topic tomography[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007: 520-529.
[46] WANG C, BLEI D M, HECKERMAN D. Continuous time dynamic topic models[C]//Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 2008: 579-586.
[47] IWATA T, WATANABE S, YAMADA T, et al. Topic tracking model for analyzing consumer purchase behavior[C]//Proceedings of the 21st International Joint Conference on Artificial Intelligence, 2009: 1427-1432.
[48] GOU Z, HAN L, SUN L, et al. Constructing dynamic topic models based on variational autoencoder and factor graph[J]. IEEE Access, 2018, 6: 53102-53111.
[49] GOU Z, LI Y, HUO Z. A method for constructing supervised time topic model based on variational autoencoder[J]. Scientific Programming, 2021(12): 1-11.
[50] SHAHBAZI Z, BYUN Y C. Topic prediction and knowledge discovery based on integrated topic modeling and deep neural networks approaches[J]. Journal of Intelligent and Fuzzy Systems, 2021, 41(1): 2441-2457.
[51] CVEJOSKI K, SáNCHEZ R J, OJEDA C. Neural dynamic focused topic model[J]. arXiv:2301.10988, 2023.
[52] MIAO Y, YU L, BLUNSOM P. Neural variational inference for text processing[C]//Proceedings of the 33rd International Conference on Machine Learning, 2016: 1727-1736.
[53] RAHIMI H, NAACKE H, CONSTANTIN C, et al. ANTM: an aligned neural topic model for exploring evolving topics[J]. arXiv:2302.01501, 2023.
[54] MARTINELLI D D. Evolution of Alzheimer’s disease research from a healthtech perspective: insights from text mining[J]. International Journal of Information Man-agement Data Insights, 2022, 2(2): 100089.
[55] YU D, XIANG B. Discovering topics and trends in the field of artificial intelligence: using LDA topic modeling[J]. Expert Systems with Applications, 2023, 225: 120114.
[56] LIU Y, WANG J, QIAN Y, et al. Dynamic topic model for tracking topic evolution and measuring popularity of scientific literature[C]//Proceedings of the 2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC), 2021: 315-320.
[57] HUANG Y, WANG R, HUANG B, et al. Sentiment classification of crowdsourcing participants’ reviews text based on LDA topic model[J]. IEEE Access, 2021, 9: 108131-108143.
[58] LIANG Q, RANGANATHAN S, WANG K, et al. JST-RR model: joint modeling of ratings and reviews in sentiment-topic prediction[J]. Technometrics, 2023, 65(1): 57-69.
[59] WANG Z, GAO P, CHU X. Sentiment analysis from Customer-generated online videos on product review using topic modeling and multi-attention BLSTM[J]. Advanced Engineering Informatics, 2022, 52: 101588.
[60] PRAVEEN S, VAJROBOL V. Understanding the perceptions of healthcare researchers regarding ChatGPT: a study based on bidirectional encoder representation from transformers (BERT) sentiment analysis and topic modeling[J]. Annals of Biomedical Engineering, 2023, 51: 1654-1656.
[61] SUN L. Automatic language identification using suprasegmental feature and supervised topic model[C]//Proceedings of the 2nd Symposium on Signal Processing Systems, 2020: 69-73.
[62] ZHANG P, RAN H, JIA C, et al. A lightweight propagation path aggregating network with neural topic model for rumor detection[J]. Neurocomputing, 2021, 458: 468-477.
[63] XIE Q, HUANG J, SAHA T, et al. GRETEL: graph contrastive topic enhanced language model for long document extractive summarization[J]. arXiv:2208.09982, 2022.
[64] YANG N, JO J, JEON M, et al. Semantic and explainable research-related recommendation system based on semi-supervised methodology using BERT and LDA models[J]. Expert Systems with Applications, 2022, 190: 116209.
[65] LI H, QIAN Y, JIANG Y, et al. A novel label-based multimodal topic model for social media analysis[J]. Decision Support Systems, 2023, 164: 113863.
[66] 崔旭, 杨煜, 李姗姗.基于LDA模型的我国档案馆非物质文化遗产保护主题挖掘与演化分析——与非遗保护中心对比视角[J].图书情报工作, 2022, 66(23): 82-92.
CUI X, YANG Y, LI S S.Topic mining and evolution analysis of intangible cultural heritage protection in chinese archives based on LDA model—comparison with intangible cultural heritage protection center[J]. Library and Information Service, 2022, 66(23): 82-92.
[67] 王骞敏.国内电子陶瓷专利技术主题演化研究[J].中国陶瓷工业, 2023, 30(3): 59-65.
WANG Q M.Topic Evolution of domestic electronic ceramic patented technology[J].China Ceramic Industry, 2023, 30(3): 59-65.
[68] 陆振昇, 马超. 基于LDA模型的专利文本主题分析——以国内元宇宙领域为例[J]. 科技和产业, 2023, 23(11): 85-88.
LU Z S, MA C. Technical topic analysis in patents based on LDA: taking metaverse in China as an example [J].Science Technology and Industry, 2023, 23(11): 85-88.