Research on text categorization based on LDA

Computer Engineering and Applications ›› 2011, Vol. 47 ›› Issue (13): 150-153.

• 数据库、信号与信息处理 • Previous Articles Next Articles

Research on text categorization based on LDA

YAO Quanzhu，SONG Zhili，PENG Cheng

School of Computer Science & Engineering，Xi’an University of Technology，Xi’an 710048，China

Received:1900-01-01 Revised:1900-01-01 Online:2011-05-01 Published:2011-05-01

基于LDA模型的文本分类研究

姚全珠，宋志理，彭程

西安理工大学计算机科学与工程学院，西安 710048

Abstract

Abstract: When the text corpuses are high-dimensional and large-scale，the traditional dimension reduction algorithms will expose their limitations.A Chinese text categorization algorithm based on LDA is presented.In the discriminative frame of Support Vector Machine（SVM），Latent Dirichlet Allocation（LDA） is used to give a generative probabilistic model for the text corpus，which reduces each document to fixed valued features——The probabilistic distribution on a set of latent topics.Gibbs sampling is used for parameter estimation.In the process of modeling the corpus，a latent topics-document matrix associated with the corpus has been constructed for training SVM.Standard method of Bayes is used for reference to get the best number of topics.Compared to Vector Space Model（VSM） for text expression combined SVM and the classifier based on Latent Semantic Indexing（LSI） combined SVM，the experimental result shows that the proposed method for text categorization is practicable and effective.

Key words: text categorization, Latent Dirichlet Allocation（LDA）, Gibbs sampling, Bayes statistics theory

摘要： 针对传统的降维算法在处理高维和大规模的文本分类时存在的局限性，提出了一种基于LDA模型的文本分类算法，在判别模型SVM框架中，应用LDA概率增长模型，对文档集进行主题建模，在文档集的隐含主题-文本矩阵上训练SVM，构造文本分类器。参数推理采用Gibbs抽样，将每个文本表示为固定隐含主题集上的概率分布。应用贝叶斯统计理论中的标准方法，确定最优主题数T。在语料库上进行的分类实验表明，与文本表示采用VSM结合SVM，LSI结合SVM相比，具有较好的分类效果。

关键词: 文本分类, 潜在狄利克雷分配（LDA）模型, Gibbs抽样, 贝叶斯统计理论

YAO Quanzhu，SONG Zhili，PENG Cheng. Research on text categorization based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.

姚全珠，宋志理，彭程. 基于LDA模型的文本分类研究[J]. 计算机工程与应用, 2011, 47(13): 150-153.

[1]	SHEN Yanguang, JIA Yaoqing. Text Categorization Method Based on Word Co-occurrence and Graph Convolution [J]. Computer Engineering and Applications, 2021, 57(11): 173-178.
[2]	HU Can, CUI Xiaohui. Research on SNS Users’ Posting Behavior and Interest Prediction [J]. Computer Engineering and Applications, 2020, 56(9): 99-105.
[3]	ZHANG Weiwei, HU Yaqi, ZHAI Guangyu, LIU Zhipeng. Academic Abstract Clustering Method Based on LDA Model and Doc2vec [J]. Computer Engineering and Applications, 2020, 56(6): 180-185.
[4]	CHEN Huan, HUANG Bo, ZHU Yimin, YU Lei, YU Yuxin. Short Text Emotion Classification Method Combining LDA and Self-Attention [J]. Computer Engineering and Applications, 2020, 56(18): 165-170.
[5]	QIN Xu, YANG Wenzhong, WANG Xueying, MA Guoxiang, WANG Qingpeng. Multi-source Topic Fusion Model Based on Co-occurrence Relation [J]. Computer Engineering and Applications, 2020, 56(10): 157-162.
[6]	ZHU Hongzhen1, CHEN Pinghua1, CAI Guilan2. Research on Application of Relationship Mining into Red Wine Data Based on LDA Model [J]. Computer Engineering and Applications, 2019, 55(4): 148-153.
[7]	WANG Hong, ZHANG Hao, SHI Jinchuan. Research on domain ontology concept acquisition method based on Latent Dirichlet Allocation [J]. Computer Engineering and Applications, 2018, 54(13): 252-257.
[8]	LI Xinru, XIA Yang, ZHANG Shuoshuo. Point of interest recommendation algorithm based on similarity integration and dynamic prediction [J]. Computer Engineering and Applications, 2018, 54(10): 105-109.
[9]	LIU Haifeng, LIU Shousheng, SONG Aling. Improved method of IG feature selection based on word frequency distribution [J]. Computer Engineering and Applications, 2017, 53(4): 113-117.
[10]	LI Shuang, LI Bailin, DI Shilei, LUO Jianqiao. Inspection for railway fasteners based on entropy-weighted BOW model [J]. Computer Engineering and Applications, 2017, 53(21): 185-189.
[11]	WEI Wen1, YANG Huihua1，2, LI Lingqiao1，2, YANG Hao1, HE Shengtao3. Feature generation and selection method for short text of urban management cases and its application [J]. Computer Engineering and Applications, 2017, 53(18): 115-120.
[12]	XIAO Bao1, LI Pu2，3, JIANG Yuncheng2. Combing lexical features and LDA for semantic relatedness measure [J]. Computer Engineering and Applications, 2017, 53(12): 152-157.
[13]	WU Feifei, JI Donghong, LV Chaozhen. Analysis of user model based on LDA and CTR [J]. Computer Engineering and Applications, 2016, 52(6): 50-54.
[14]	SHI Qingwei, WANG Jun, GUO Pengfei. Citation-author-topic evolution model applied in expert retrieval [J]. Computer Engineering and Applications, 2016, 52(6): 55-60.
[15]	SHI Qingwei, CONG Shiyuan. Research on text categorization based on mRMR and LDA [J]. Computer Engineering and Applications, 2016, 52(5): 127-133.

Research on text categorization based on LDA

基于LDA模型的文本分类研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics