基于信息熵与动态聚类的文本特征选择方法

摘要/Abstract

摘要： 根据科技文献的结构特点，搭建了一个四层挖掘模式，提出了一种应用于科技文献分类的文本特征选择方法。该方法首先依据科技文献的结构将其分为四个层次，然后采用K-means聚类对前三层逐层实现特征词提取，最后再使用Aprori算法找出第四层的最大频繁项集，并作为第四层的特征词集合。在该方法中，针对K-means算法受初始中心点的影响较大的问题，首先采用信息熵对聚类对象赋权的方式来修正对象间的距离函数，然后再利用初始聚类的赋权函数值选出较合适的初始聚类中心点。同时，通过为K-means算法的终止条件设定标准值，来减少算法迭代次数，以减少学习时间；通过删除由信息动态变化而产生的冗余信息，来减少动态聚类过程中的干扰，从而使算法达到更准确更高效的聚类效果。上述措施使得该文本特征选择方法能够在文献语料库中更加准确地找到特征词，较之以前的方法有很大提升，尤其是在科技文献方面更为适用。实验结果表明，当数据量较大时，该方法结合改进后的K-means算法在科技文献分类方面有较高的性能。

关键词: K-means算法, 动态聚类, 特征选择, 信息熵

Abstract: By means of a four-mining model which is constructed based on the structural characteristics of scientific literatures, a text feature selection method is proposed to apply in classification of scientific literatures. The proposed method firstly divides scientific literature into four layers according to its structure, and then selects features progressively for the former three layers by K-means algorithm, and finally finds out the maximum frequent itemsets of fourth layer by Aprori algorithm to act as a collection of fourth layer features. Meanwhile, K-means algorithm is also improved which firstly uses information entropy empower the clustering objects to correct the distance function, and then employs empowerment function value to select the optimal initial clustering center, and subsequently reduces algorithm iterations and learning time by setting the standard value for termination condition of the algorithm and reduces interference of dynamic clustering by removing redundant information from the changing information to make the algorithm achieve more accurate and efficient clustering effect. So, it is possible for this proposed method to find features more accurately in the literature corpus. Experimental results show that the proposed method is feasible and effective, and has higher performance in scientific literature classification which is compared with the previous methods.

Key words: K-means algorithm, dynamic clustering, feature selection, information entropy

唐立力. 基于信息熵与动态聚类的文本特征选择方法[J]. 计算机工程与应用, 2015, 51(19): 152-157.

TANG Lili. Text feature selection method based on information entropy and dynamic clustering[J]. Computer Engineering and Applications, 2015, 51(19): 152-157.

[1]	李莉，纪欣沅，宋嵩. 回环软件缺陷数量预测模型[J]. 计算机工程与应用, 2021, 57(7): 158-163.
[2]	李静星，杨有龙. 针对高维数据的马尔科夫毯特征选择[J]. 计算机工程与应用, 2021, 57(6): 58-66.
[3]	王鹏，叶学义，王涛，钱丁炜. 双偏差双空间局部方向模式的人脸识别[J]. 计算机工程与应用, 2021, 57(4): 91-99.
[4]	林炜星，王宇嘉，陈万芬，梁海娜. 基于多因子粒子群的高维数据特征选择算法[J]. 计算机工程与应用, 2021, 57(22): 199-207.
[5]	李珑珠，林耀进，吕彦，卢舜，王晨曦. 利用邻域信息交互的在线流特征选择算法[J]. 计算机工程与应用, 2021, 57(21): 102-108.
[6]	江魁，丘远东，郑浩城. 基于信息熵与LSTM的ICMPv6 DDoS攻击检测方法[J]. 计算机工程与应用, 2021, 57(21): 148-154.
[7]	陈倩茹，李雅丽，许科全，刘铱龙，王淑琴. 自调优自适应遗传算法的WKNN特征选择方法[J]. 计算机工程与应用, 2021, 57(20): 164-171.
[8]	宋世杰，陈开颜，张阳. 信息熵角度下的深度学习旁路安全评估框架[J]. 计算机工程与应用, 2021, 57(17): 138-146.
[9]	武炜杰，张景祥. 融合分类信息的随机森林特征选择算法及应用[J]. 计算机工程与应用, 2021, 57(17): 147-156.
[10]	张念蓬，吴旭，朱强. 基于熵的过采样框架[J]. 计算机工程与应用, 2021, 57(13): 96-101.
[11]	邱云飞，高华聪. 混合Filter与改进自适应GA的特征选择方法[J]. 计算机工程与应用, 2021, 57(11): 95-102.
[12]	潘成胜，张斌，吕亚娜，杜秀丽，邱少明. 改进灰狼优化算法的K-Means文本聚类[J]. 计算机工程与应用, 2021, 57(1): 188-193.
[13]	霍林，陆寅丽. 改进粒子群算法应用于Android恶意应用检测[J]. 计算机工程与应用, 2020, 56(7): 96-101.
[14]	陈建促，王越，朱小飞，李章宇，林志航. 融合多特征图的野生动物视频目标检测方法[J]. 计算机工程与应用, 2020, 56(7): 221-227.
[15]	林克正，张元铭，李昊天. 信息熵加权的HOG特征提取算法研究[J]. 计算机工程与应用, 2020, 56(6): 147-152.

基于信息熵与动态聚类的文本特征选择方法

Text feature selection method based on information entropy and dynamic clustering

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics