计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (12): 155-161.DOI: 10.3778/j.issn.1002-8331.1803-0259

• 模式识别与人工智能 • 上一篇    下一篇

基于重叠度与完整度的LDA主题优选方法

柏志安1,曾剑平2   

  1. 1.上海交通大学医学院附属瑞金医院 计算机中心,上海 200025
    2.复旦大学 计算机科学技术学院,上海 200433
  • 出版日期:2019-06-15 发布日期:2019-06-13

Optimal Selection Method for LDA Topics Based on Degree of Overlap and Completeness

BAI Zhi’an1, ZENG Jianping2   

  1. 1.Computer Centre, Rui Jin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
    2.School of Computer Science, Fudan University, Shanghai 200433, China
  • Online:2019-06-15 Published:2019-06-13

摘要: 以LDA为基础的许多主题模型能够从一定数量的文本中推断出主题个数及主题描述,其存在的问题是主题个数难于确定,也难于决定描述每个主题的特征词汇。针对这个问题,结合LDA与TF-IDF量化的效果,同时考虑对原文本集的涵盖程度以及主题间的独立性,提出了一种Overlap-Completeness得分法的主题区分度优选方法。该方法在LDA建模的基础上,利用TF-IDF获取主题最具代表性的词汇,定义主题词汇间的重叠度、表达的完整度,给出了主题优选的评价方法。最终不仅能得到最佳主题数目,而且还能得到每个主题的最合适的描述词汇。在信息安全新闻文本集上进行了实验研究,结果表明该方法与基本的LDA模型相比,更能选择出有区分度的主题和有代表性的词汇。

关键词: LDA模型, TF-IDF, 主题识别, 重叠度, 完整度

Abstract: Many topic modeling methods can infer topic number and topic description from large text data set based on LDA, however, there exists several problems, such as determination of topic number, and selection of topic words. The paper proposes a new method to select optimal topic description based on Overlap-Completeness score. It combines LDA and TF-IDF, and takes completeness of words and word independency into consideration. Based on the result of LDA, TF-IDF is utilized to select distinctive words for each topic, then the degree of overlap between the vocabularies of different topics, and the degree of completeness in topic description are defined, and finally the optimal selection method is presented. The method can not only get the best topic number, but also the best description words for each topic. Experiments based on news about information security topic show that, compared with the traditional LDA model, this method can get distinctive topics and representative words.

Key words: LDA model, TF-IDF, topic detection, degree of overlap, degree of completeness