计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (23): 108-112.DOI: 10.3778/j.issn.1002-8331.1706-0214

• 模式识别与人工智能 • 上一篇    下一篇

基于受限玻尔兹曼机的分布式主题特征提取

江雨燕,桂  伟   

  1. 安徽工业大学 管理科学与工程学院,安徽 马鞍山 243002
  • 出版日期:2017-12-01 发布日期:2017-12-14

Distributed theme feature extraction based on restricted Boltzmann machine

JIANG Yuyan, GUI Wei   

  1. School of Management Science and Engineering, Anhui University of Technology, Ma’anshan, Anhui 243002, China
  • Online:2017-12-01 Published:2017-12-14

摘要: 随着大数据时代的来临,如何有效从海量的文本数据中挖掘和分析主题特征已成为学者们的研究重点。隐含狄利克雷分配(Latent Dirichlet Allocation,LDA)作为经典的概率主题模型,因其自身优越的文本分析能力被广泛应用。然而,该模型大多以包含隐含主题变量的有向图的形式存在,实现文档的表达具有局限性。而分布式表示方法定义文档的语义分布在多个主题中并由多主题特征相乘得到;且由于传统的无监督特征提取模型无法有效处理含类别标记的文档数据,故在研究受限玻尔兹曼机(Restricted Bolzmann Machine,RBM)的基础上,结合文本主题的分布式特性,提出了基于RBM的分布式主题特征提取模型NRBM,其自身作为典型的半监督模型能够有效利用文档中的多标记信息。最终与标准LDA主题模型的对比实验证明了NRBM模型的优越性。

关键词: 文本数据, 概率主题模型, 隐含狄利克雷分配, 受限玻尔兹曼机

Abstract: With the advent of the era of big data, it has become the focus of scholars how to effectively explore and analyze the topic characteristics from a large amount of text data. Latent Dirichlet allocation as a classical probability theme model is widely used because of its superior text analysis ability. However, most of the models exist in the form of directed graphs which contain implicit subject variables. The distributed representation method defines the semantic distribution of documents in a variety of topics. Moreover, the traditional unsupervised feature extraction model can not deal with the category tagged document data effectively. Therefore, this paper combines with the distributed characteristics of text topic based on the research of restricted Bolzmann machine. Finally it proposes a semi-supervised distributed topic feature extraction model called NRBM which can effectively use the multi-tag information in the document. Finally, its superiority is proved by comparing with the standard LDA topic model.

Key words: text data, probabilistic topic model, latent Dirichlet allocation, restricted Bolzmann machine