Research on Text Clustering Based on Sentence Vector and Convolutional Neural Network

doi:10.3778/j.issn.1002-8331.2104-0203

Abstract

Abstract: Aiming at the problems of the high dimensionality of text features in text clustering, and ignoring the order and semantics of document words, this paper proposes a text feature extraction method based on Doc2vec and convolutional neural networks（CNN） for text clustering. Firstly, use the Doc2vec model to convert the text in the training dataset into sentence vectors, fully consider the order and semantics of the document words. Then, use CNN to extract the deep semantic features of the text, solve the problem of high feature dimensions, and obtain the data that can be used for clustering text feature vector. Finally, use the [k]-means algorithm for clustering. The experimental results show that on the crawled Sogou news data, the accuracy of the text clustering model proposed in this paper has reached 0.776, and the F-score index has reached 0.780, which is improved compared to other text clustering models.

Key words: convolutional neural networks（CNN）, Doc2vec, text representation, text clustering

摘要： 针对文本聚类时文本特征维度高，忽略文档词排列顺序和语义等问题，提出了一种基于句向量（Doc2vec）和卷积神经网络（convolutional neural networks，CNN）的文本特征提取方法用于文本聚类。首先利用Doc2vec模型把训练数据集中的文本转换成句向量，充分考虑文档词排列顺序和语义；然后利用CNN提取文本的深层语义特征，解决特征维度高的问题，得到能够用于聚类的文本特征向量；最后使用[k]-means算法进行聚类。实验结果表明，在爬取的搜狗新闻数据上，该文本聚类模型的准确率达到了0.776，F值指标达到了0.780，相比其他文本聚类模型均有所提高。

关键词: 卷积神经网络（CNN）, Doc2vec, 文本表示, 文本聚类

JIA Junxia, WANG Huizhen, REN Kai, KANG Wen. Research on Text Clustering Based on Sentence Vector and Convolutional Neural Network[J]. Computer Engineering and Applications, 2022, 58(16): 123-128.

贾君霞, 王会真, 任凯, 康文. 基于句向量和卷积神经网络的文本聚类研究[J]. 计算机工程与应用, 2022, 58(16): 123-128.

References

[1] SALTON G，WONG A，YANG C S.A vector space model for automatic indexing[J].Communications of the ACM，1975，18（11）：613-620.
[2] BLEI M，NG A Y，JORDAN M I.Latent Dirichlet allocation[J].Journal of Machine Learning Research，2003，3（4/5）：993-1022.
[3] BENGIO Y，SCHWENK H，SENECAL J S，et al.Neural probabilistic language models[J].Journal of Machine Learning Research，2003，3（6）：1137-1155.
[4] MIKOLOV T，SUTSKEVER I，CHEN K，et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems 26：27th Annual Conference on Neural Information Processing Systems，2013：3111-3119.
[5] LE Q V，MIKOLOV T.Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on Machine Learning，2014.
[6] 孙昭颖，刘功申.面向短文本的神经网络聚类算法研究[J].计算机科学，2018，45（S1）：392-395.
SUN Z Y，LIU G S.Research on neural network clustering algorithm for short text[J].Computer Science，2018，45（S1）：392-395.
[7] 杨俊峰，尹光花.基于Word2vec和CNN的短文本聚类研究[J].信息与电脑（理论版），2019，31（24）：20-22.
YANG J F，YIN G H.Short text clustering based on Word2vec and CNN[J].Information and Computer（Theoretical Edition），2019，31（24）：20-22.
[8] WANG Z B，MA L，ZHANG Y Q.A hybrid document feature extraction method using latent Dirichlet allocation and Word2Vec[C]//2016 IEEE 1st International Conference on Data Science in Cyberspace，2016.
[9] PARK S T，LIU C.A study on topic models using LDA and Word2Vec in travel route recommendation：focus on convergence travel and tours reviews[J].Personal and Ubiquitous Computing，2020，26：429-445.
[10] 张卫卫，胡亚琦，翟广宇，等.基于LDA模型和Doc2vec的学术摘要聚类方法[J].计算机工程与应用，2020，56（6）：180-185.
ZHANG W W，HU Y Q，ZHAI G Y，et al.Academic abstract clustering method based on LDA model and Doc2vec[J].Computer Engineering and Applications，2020，56（6）：180-185.
[11] 周飞燕，金林鹏，董军.卷积神经网络研究综述[J].计算机学报，2017，40（6）：1229-1251.
ZHOU F Y，JIN L P，DONG J.Review of convolutional neural network[J].Chinese Journal of Computers，2017，40（6）：1229-1251.
[12] KIM Y.Convolutional neural networks for sentence classification[C]//2014 Conference on Empirical Methods in Natural Language Processing，2014：1746-1751.
[13] CHANG W B，XU Z Z，ZHOU S H，et al.Research on detection methods based on Doc2vec abnormal comments[J].Future Generation Computer Systems，2018，86：656-662.
[14] 胡朝举，赵晓伟.基于词向量技术和混合神经网络的情感分析[J].计算机应用研究，2018，35（12）：3556-3559.
HU Z J，ZHAO X W.Sentiment analysis based on word vector technology and hybrid neural network[J].Application Research of Computer，2018，35（12）：3556-3559.
[15] 马存.基于Word2Vec的中文短文本聚类算法研究与应用[D].沈阳：中国科学院大学（中国科学院沈阳计算技术研究所），2018.
MA C.Research and application of Chinese short text clustering algorithm based on Word2vec[D].Shengyang：University of Chinese Academy of Sciences（Shenyang Institute of Computing Technology，Chinese Academy of Sciences），2018.
[16] 梁吉业，乔洁，曹付元，等.面向短文本分析的分布式表示模型[J].计算机研究与发展，2018，55（8）：1631-1640.
LIANG J Y，QIAO J，CAO F Y，et al.A distributed representation model for short text analysis[J].Computer Research and Development，2018，55（8）：1631-1640.