基于词条之间关联关系的文档聚类

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (7): 86-90.

基于词条之间关联关系的文档聚类

任建华，沈炎彬，孟祥福，王伟

辽宁工程技术大学电子与信息工程学院，辽宁葫芦岛 125105

出版日期:2016-04-01 发布日期:2016-04-19

Document clustering based on association relations between terms

REN Jianhua, SHEN Yanbin, MENG Xiangfu, WANG Wei

School of Electronic and Information Engineering, Liaoning Technical University, Huludao, Liaoning 125105, China

Online:2016-04-01 Published:2016-04-19

摘要/Abstract

摘要： 针对现有的空间向量模型在进行文档表示时忽略词条之间的语义关系的不足，提出了一种新的基于关联规则的文档向量表示方法。在广义空间向量模型中分析词条的频繁同现关系得到词条同现语义，根据关联规则分析词条之间的关联相关性，挖掘出文档中词条之间的潜在关联语义关系，将词条同现语义和关联语义线性加权对文档进行表示。实验结果表明，与BOW模型和GVSM模型相比，采用关联规则文档向量表示的文档聚类结果更准确。

关键词: 文档聚类, 关联关系, 词条同现, 文档相似度, 潜在语义

Abstract: For the existing vector space model to omit making insufficient semantic relationships between terms in documents representation, this paper proposes a novel document vector representation approach based association relationship. In terms of generalized vector space model, it captures the frequent co-occurrence semantic relations between terms, and then analyzes the correlation between related terms based on association rules, digging out the potential relevance of semantic relationships between terms in the document. It represents documents with linear weighting co-occurrence semantic relations with association semantic. Experimental results show that, compared with the BOW model and GVSM model, the clustering results using association rules document vector represented are more accurate.

Key words: document clustering, association, terms co-occurrence, document similarity, latent semantic

任建华，沈炎彬，孟祥福，王伟. 基于词条之间关联关系的文档聚类[J]. 计算机工程与应用, 2016, 52(7): 86-90.

REN Jianhua, SHEN Yanbin, MENG Xiangfu, WANG Wei. Document clustering based on association relations between terms[J]. Computer Engineering and Applications, 2016, 52(7): 86-90.

[1]	孙曰昕，马慧芳，姚伟，张志昌. 结合互信息和主题模型的微博话题发现方法[J]. 计算机工程与应用, 2016, 52(6): 61-66.
[2]	王成勇，杜庆伟，孙静，孙振. 基于特征偏好的XML文档聚类算法[J]. 计算机工程与应用, 2016, 52(12): 64-68.
[3]	刘勘，朱芳芳. 基于潜在语义索引的科技文献主题挖掘[J]. 计算机工程与应用, 2014, 50(24): 113-117.
[4]	钟将，刘荣辉. 一种改进的KNN文本分类[J]. 计算机工程与应用, 2012, 48(2): 142-144.
[5]	金小峰. 一种大容量文本集的智能检索方法[J]. 计算机工程与应用, 2011, 47(7): 143-145.
[6]	张玉芳，张洪，熊忠阳，李文田. 结合概率潜在语义分析的文本谱聚类方法研究[J]. 计算机工程与应用, 2011, 47(36): 134-136.
[7]	李银花1，王素格2. 文本褒贬倾向判别研究[J]. 计算机工程与应用, 2011, 47(18): 160-162.
[8]	宋剑杰¹，王伟². 融合SOM和改进PSO的Web文档集成聚类算法[J]. 计算机工程与应用, 2010, 46(34): 111-114.
[9]	陈登科，孔繁胜. 基于高斯pLSA模型与项目的协同过滤混合推荐[J]. 计算机工程与应用, 2010, 46(23): 209-211.
[10]	杨瑞龙¹，朱庆生¹，谢洪涛^1，2. 快速混合Web文档聚类[J]. 计算机工程与应用, 2010, 46(22): 12-15.
[11]	朱颢东^1，2，钟勇^1，2. 结合优化的文档频和LSA的特征选择方法[J]. 计算机工程与应用, 2009, 45(34): 121-123.
[12]	廖一星. 一种新的监督潜在语义模型[J]. 计算机工程与应用, 2009, 45(33): 117-119.
[13]	欧建林，林茜，史晓东. 潜在语义分析在连续语音识别中的应用[J]. 计算机工程与应用, 2009, 45(32): 111-113.
[14]	张培颖. 基于语义相似度的自动文摘评价方法[J]. 计算机工程与应用, 2009, 45(25): 145-147.
[15]	曾广平. 贝叶斯概率LSA模型权重更新算法[J]. 计算机工程与应用, 2009, 45(21): 88-90.