计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (8): 56-68.DOI: 10.3778/j.issn.1002-8331.2309-0030
王振彪,徐贞顺,刘纳,张文豪,唐增金,王正安
出版日期:
2024-04-15
发布日期:
2024-04-15
WANG Zhenbiao, XU Zhenshun, LIU Na, ZHANG Wenhao, TANG Zengjin, WANG Zheng’an
Online:
2024-04-15
Published:
2024-04-15
摘要: 主题模型是一种数据挖掘的方法,可以自动地从大量文件或数据中提取潜在的模式或主题,并将对应的数据分配到相应的模式或主题中。主题模型已广泛应用于文本聚类或分类、主题抽取、主题演变、情感分析和摘要总结等领域。监督式主题模型和非监督主题模型的区别在于是否依赖标注信息。近年来,监督式主题模型在数据挖掘任务中逐渐兴起,使得越来越多的任务倾向于采用监督式方法进行优化。陈述了监督式主题模型相关内容,介绍常用的数据集和评价指标;分别从模型和应用的角度对各种类型的监督式主题模型进行了深入对比分析。最后,阐述了主题模型当前研究所面临的挑战,并对未来监督式主题模型的研究方向进行展望。
王振彪, 徐贞顺, 刘纳, 张文豪, 唐增金, 王正安. 监督式主题模型及其应用综述[J]. 计算机工程与应用, 2024, 60(8): 56-68.
WANG Zhenbiao, XU Zhenshun, LIU Na, ZHANG Wenhao, TANG Zengjin, WANG Zheng’an. Review of Supervised Topic Models and Applications[J]. Computer Engineering and Applications, 2024, 60(8): 56-68.
[1] BLEI D M. Probabilistic topic models[J]. Communications of the ACM, 2012, 55(4): 77-84. [2] ALGHAMDI R, ALFALQI K. A survey of topic modeling in text mining[J]. Int J Adv Comput Sci Appl (IJACSA), 2015, 6(1). [3] CHURCHILL R, SINGH L. The evolution of topic modeling[J]. ACM Computing Surveys, 2022, 54(10S): 1-35. [4] 韩亚楠, 刘建伟, 罗雄麟.概率主题模型综述[J].计算机学报, 2021, 44(6): 1095-1139. HAN Y N, LIU J W, LUO X L.A survey on probabilistic topic model[J].Chinese Journal of Computers, 2021, 44(6): 1095-1139. [5] BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation[J]. Journal of machine Learning Research, 2003, 3(1): 993-1022. [6] MCAULIFFE J, BLEI D. Supervised topic models[C]//Advances in Neural Information Processing Systems, 2007. [7] CHEN Z, LIU B. Mining topics in documents: standing on the shoulders of big data[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014: 1116-1125. [8] MIMNO D, WALLACH H, TALLEY E, et al. Optimizing semantic coherence in topic models[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 2011: 262-272. [9] HOYLE A, GOEL P, HIAN-CHEONG A, et al. Is automated topic model evaluation broken? the incoherence of coherence[C]//Advances in Neural Information Processing Systems, 2021: 2018-2033. [10] WILCOX K T, JACOBUCCI R, ZHANG Z, et al. Supervised latent Dirichlet allocation with covariates: a Bayesian structural and measurement model of text and covariates[J]. Psychological Methods, 2021.DOI:10.31234/osf.io/62tc3. [11] VU D, TRUONG K, NGUYEN K, et al. Revisiting supervised word embeddings[J]. J Inf Sci Eng, 2022, 38(2): 413-427. [12] XU W, EGUCHI K. A supervised topic embedding model and its application[J]. Plos One, 2022, 17(11): e0277104. [13] CHURCHILL R, SINGH L. Topic-noise models: modeling topic and noise distributions in social media post collections[C]//Proceedings of the 21st IEEE International Conference on Data Mining, 2021: 71-80. [14] CHURCHILL R, SINGH L, RYAN R, et al. A guided topic-noise model for short texts[C]//Proceedings of the 31st ACM World Wide Web Conference, 2022: 2870-2878. [15] RAHIMI M, ZAHEDI M, MASHAYEKHI H. A probabilistic topic model based on short distance Co-occurrences[J]. Expert Systems with Applications, 2022, 193: 116518. [16] GROOTENDORST M. BERTopic: neural topic modeling with a class-based TF-IDF procedure[J]. arXiv:2203.05794, 2022 [17] ZHAO H, PHUNG D, HUYNH V, et al. Topic modelling meets deep neural networks: a survey[J]. arXiv:2103.00498, 2021. [18] FENG J, ZHANG Z, DING C, et al. Context reinforced neural topic modeling over short texts[J]. Information Sciences, 2022, 607: 79-91. [19] LIU L, HUANG H, GAO Y, et al. Improving neural topic modeling via Sinkhorn divergence[J]. Information Processing and Management, 2022, 59(3): 102864. [20] YANG Y, ZHANG K, FAN Y. sDTM: a supervised bayesian deep topic model for text analytics[J]. Information Systems Research, 2023, 34(1): 137-156. [21] MURSHED B A H, MALLAPPA S, ABAWAJY J, et al. Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis[J]. Artificial Intelligence Review, 2023, 56(6): 5133-5260. [22] WANG C, BLEI D M, FEI-FEI L. Simultaneous image classification and annotation[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009: 1903-1910. [23] RAMAGE D, HALL D, NALLAPATI R, et al. Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009: 248-256. [24] ZHU J, AHMED A, XING E P. MedLDA: maximum margin supervised topic models for regression and classification[C]//Proceedings of the 26th Annual International Conference on Machine Learning, 2009: 1257-1264. [25] CHEN J, HE J, SHEN Y, et al. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture[C]//Advances in Neural Information Processing Systems, 2015. [26] ZHANG Y, MA J, WANG Z, et al. LF-LDA: a topic model for multi-label classification[C]//Advances in Internetworking, Data & Web Technologies, 2018. [27] WANG W, GUO B, SHEN Y, et al. Twin labeled LDA: a supervised topic model for document classification[J]. Applied Intelligence, 2020, 50(12): 4602-4615. [28] ZHANG G, ZHENG H, LIU X. Co-STM text categorization method based on supervised topic model[C]//Proceedings of the 4th International Conference on Advanced Electronic Materials, Computers and Software Engineering, 2021. [29] NGUYEN T, TUAN L A. Contrastive learning for neural topic model[C]//Proceedings of the 35th Conference on Neural Information Processing Systems, 2021: 11974-11986. [30] TANG R, YANG C, WANG Y. A cross-domain multimodal supervised latent topic model for item tagging and cold-start recommendation[J]. IEEE MultiMedia, 2023, 30(3): 48-62. [31] ZHU B, CAI Y, REN H. Graph neural topic model with commonsense knowledge[J]. Information Processing & Management, 2023, 60(2): 103215. [32] LI P, TSENG C, ZHENG Y, et al. Guided semi-supervised non-negative matrix factorization[J]. Algorithms, 2022, 15(5): 136. [33] LI X, WANG B, WANG Y, et al. Weakly supervised prototype topic model with discriminative seed words: modifying the category prior by self-exploring supervised signals[J]. Soft Computing, 2023, 27(9): 5397-5410. [34] ADELANI D I, MASIAK M, AZIME I A, et al. Masakha-NEWS: news topic classification for african languages[J]. arXiv:2304.09972, 2023. [35] LI Y, NAIR P, LU X H, et al. Inferring multimodal latent topics from electronic health records[J]. Nature Communications, 2020, 11(1): 2536. [36] SONG Z, TORAL X S, XU Y, et al. Supervised multi-specialist topic model with applications on large-scale electronic health record data[C]//Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 2021: 1-26. [37] WANG Y, BENAVIDES R, DIATCHENKO L, et al. A graph-embedded topic model enables characterization of diverse pain phenotypes among UK biobank individuals[J]. Iscience, 2022, 25(6): 104390. [38] XIE Q, TIWARI P, GUPTA D, et al. Neural variational sparse topic model for sparse explainable text representation[J]. Information Processing and Management, 2021, 58(5): 102614. [39] ZHANG D C, LAUW H W. Variational graph author topic modeling[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022: 2429-2438. [40] ZHOU T, LAW K, CREIGHTON D. A weakly-supervised graph-based joint sentiment topic model for multi-topic sentiment analysis[J]. Information Sciences, 2022, 609: 1030-1051. [41] MALLICK T, BERGERSON J D, VERNER D R, et al. Analyzing the impact of climate change on critical infrastructure from the scientific literature: a weakly supervised NLP approach[J]. arXiv:2302.01887, 2023. [42] COPUR-GENCTURK Y, CHOI H J, COHEN A. Investigating teachers’ understanding through topic modeling: a promising approach to studying teachers’ knowledge[J]. Journal of Mathematics Teacher Education, 2023, 26(3): 281-302. [43] ZHANG Y, ZHANG Y, MICHALSKI M, et al. Effective seed-guided topic discovery by integrating multiple types of contexts[C]//Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, 2023: 429-437. [44] BLEI D M, LAFFERTY J D. Dynamic topic models[C]//Proceedings of the 23rd International Conference on Machine Learning, 2006: 113-120. [45] NALLAPATI R M, DITMORE S, LAFFERTY J D, et al. Multiscale topic tomography[C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2007: 520-529. [46] WANG C, BLEI D M, HECKERMAN D. Continuous time dynamic topic models[C]//Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 2008: 579-586. [47] IWATA T, WATANABE S, YAMADA T, et al. Topic tracking model for analyzing consumer purchase behavior[C]//Proceedings of the 21st International Joint Conference on Artificial Intelligence, 2009: 1427-1432. [48] GOU Z, HAN L, SUN L, et al. Constructing dynamic topic models based on variational autoencoder and factor graph[J]. IEEE Access, 2018, 6: 53102-53111. [49] GOU Z, LI Y, HUO Z. A method for constructing supervised time topic model based on variational autoencoder[J]. Scientific Programming, 2021(12): 1-11. [50] SHAHBAZI Z, BYUN Y C. Topic prediction and knowledge discovery based on integrated topic modeling and deep neural networks approaches[J]. Journal of Intelligent and Fuzzy Systems, 2021, 41(1): 2441-2457. [51] CVEJOSKI K, SáNCHEZ R J, OJEDA C. Neural dynamic focused topic model[J]. arXiv:2301.10988, 2023. [52] MIAO Y, YU L, BLUNSOM P. Neural variational inference for text processing[C]//Proceedings of the 33rd International Conference on Machine Learning, 2016: 1727-1736. [53] RAHIMI H, NAACKE H, CONSTANTIN C, et al. ANTM: an aligned neural topic model for exploring evolving topics[J]. arXiv:2302.01501, 2023. [54] MARTINELLI D D. Evolution of Alzheimer’s disease research from a healthtech perspective: insights from text mining[J]. International Journal of Information Man-agement Data Insights, 2022, 2(2): 100089. [55] YU D, XIANG B. Discovering topics and trends in the field of artificial intelligence: using LDA topic modeling[J]. Expert Systems with Applications, 2023, 225: 120114. [56] LIU Y, WANG J, QIAN Y, et al. Dynamic topic model for tracking topic evolution and measuring popularity of scientific literature[C]//Proceedings of the 2021 IEEE Sixth International Conference on Data Science in Cyberspace (DSC), 2021: 315-320. [57] HUANG Y, WANG R, HUANG B, et al. Sentiment classification of crowdsourcing participants’ reviews text based on LDA topic model[J]. IEEE Access, 2021, 9: 108131-108143. [58] LIANG Q, RANGANATHAN S, WANG K, et al. JST-RR model: joint modeling of ratings and reviews in sentiment-topic prediction[J]. Technometrics, 2023, 65(1): 57-69. [59] WANG Z, GAO P, CHU X. Sentiment analysis from Customer-generated online videos on product review using topic modeling and multi-attention BLSTM[J]. Advanced Engineering Informatics, 2022, 52: 101588. [60] PRAVEEN S, VAJROBOL V. Understanding the perceptions of healthcare researchers regarding ChatGPT: a study based on bidirectional encoder representation from transformers (BERT) sentiment analysis and topic modeling[J]. Annals of Biomedical Engineering, 2023, 51: 1654-1656. [61] SUN L. Automatic language identification using suprasegmental feature and supervised topic model[C]//Proceedings of the 2nd Symposium on Signal Processing Systems, 2020: 69-73. [62] ZHANG P, RAN H, JIA C, et al. A lightweight propagation path aggregating network with neural topic model for rumor detection[J]. Neurocomputing, 2021, 458: 468-477. [63] XIE Q, HUANG J, SAHA T, et al. GRETEL: graph contrastive topic enhanced language model for long document extractive summarization[J]. arXiv:2208.09982, 2022. [64] YANG N, JO J, JEON M, et al. Semantic and explainable research-related recommendation system based on semi-supervised methodology using BERT and LDA models[J]. Expert Systems with Applications, 2022, 190: 116209. [65] LI H, QIAN Y, JIANG Y, et al. A novel label-based multimodal topic model for social media analysis[J]. Decision Support Systems, 2023, 164: 113863. [66] 崔旭, 杨煜, 李姗姗.基于LDA模型的我国档案馆非物质文化遗产保护主题挖掘与演化分析——与非遗保护中心对比视角[J].图书情报工作, 2022, 66(23): 82-92. CUI X, YANG Y, LI S S.Topic mining and evolution analysis of intangible cultural heritage protection in chinese archives based on LDA model—comparison with intangible cultural heritage protection center[J]. Library and Information Service, 2022, 66(23): 82-92. [67] 王骞敏.国内电子陶瓷专利技术主题演化研究[J].中国陶瓷工业, 2023, 30(3): 59-65. WANG Q M.Topic Evolution of domestic electronic ceramic patented technology[J].China Ceramic Industry, 2023, 30(3): 59-65. [68] 陆振昇, 马超. 基于LDA模型的专利文本主题分析——以国内元宇宙领域为例[J]. 科技和产业, 2023, 23(11): 85-88. LU Z S, MA C. Technical topic analysis in patents based on LDA: taking metaverse in China as an example [J].Science Technology and Industry, 2023, 23(11): 85-88. |
[1] | 庄俊玺, 王琪, 赖英旭, 刘静, 靳晓宁. 基于三元深度融合的行为驱动成绩预警模型[J]. 计算机工程与应用, 2024, 60(9): 346-356. |
[2] | 范劭博, 张中杰, 黄健. 决策树剪枝加强的关联规则分类方法[J]. 计算机工程与应用, 2023, 59(5): 87-94. |
[3] | 杨寒雨, 赵晓永, 王磊. 数据归一化方法综述[J]. 计算机工程与应用, 2023, 59(3): 13-22. |
[4] | 林原, 王凯巧, 杨亮, 林鸿飞, 任璐, 丁堃. 基于pu-learning的同行评议文本情感分析[J]. 计算机工程与应用, 2023, 59(3): 143-149. |
[5] | 蒋洪迅, 江俊毅, 梁循. 基于机器学习的信用卡交易欺诈检测研究综述[J]. 计算机工程与应用, 2023, 59(21): 1-25. |
[6] | 吴辰文, 王莎莎, 曹雪同. 结合柯西分布和蚁狮算法改进的模糊聚类算法[J]. 计算机工程与应用, 2023, 59(17): 91-98. |
[7] | 唐宏, 彭金枝, 郭艳霞, 刘杰. 融合主题预测和情感推理的共情回复生成方法[J]. 计算机工程与应用, 2023, 59(14): 114-123. |
[8] | 张然, 王学志, 汪嘉葭, 孟珍. 药物-靶点相互作用预测的计算方法综述[J]. 计算机工程与应用, 2023, 59(12): 1-13. |
[9] | 姜阳, 薛哲, 李昂. 融合负载中心性的科研学者兴趣挖掘算法[J]. 计算机工程与应用, 2023, 59(12): 94-99. |
[10] | 张伊扬, 钱育蓉, 陶文彬, 冷洪勇, 李自臣, 马梦楠. 基于深度学习的属性图异常检测综述[J]. 计算机工程与应用, 2022, 58(19): 1-13. |
[11] | 盛锦超, 杜明晶, 李宇蕊, 孙嘉睿. 结合柯西核的分类型数据密度峰值聚类算法[J]. 计算机工程与应用, 2022, 58(18): 162-171. |
[12] | 周慧颖, 汪廷华, 张代俐. 多标签特征选择研究进展[J]. 计算机工程与应用, 2022, 58(15): 52-67. |
[13] | 宗晓萍,陶泽泽. 基于掌握速度的知识追踪模型[J]. 计算机工程与应用, 2021, 57(6): 117-123. |
[14] | 高天宇,王庆荣,杨磊. 粗糙集属性依赖度强化的应急数据挖掘模型[J]. 计算机工程与应用, 2021, 57(3): 87-93. |
[15] | 马洋,赵旭俊. 基于相关子空间的多源离群检测算法[J]. 计算机工程与应用, 2021, 57(17): 88-95. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||