融合SLDA主题模型的不均衡文本分类方法

doi:10.3778/j.issn.1002-8331.2003-0240

计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (12): 144-154.DOI: 10.3778/j.issn.1002-8331.2003-0240

融合SLDA主题模型的不均衡文本分类方法

唐焕玲，刘艳红，郑涵，窦全胜，鲁明羽

1.山东工商学院计算机科学与技术学院，山东烟台 264005
2.山东省高等学校协同创新中心：未来智能计算，山东烟台 264005
3.山东省高校智能信息处理重点实验室（山东工商学院），山东烟台 264005
4.大连海事大学信息科学技术学院，辽宁大连 116026

出版日期:2021-06-15 发布日期:2021-06-10

Imbalanced Text Categorization Method with SLDA Topic Model

TANG Huanling, LIU Yanhong, ZHENG Han, DOU Quansheng, LU Mingyu

1.School of Computer Science and Technology, Shandong Technology and Business University, Yantai, Shandong 264005, China
2.Co-innovation Center of Shandong Colleges and Universities：Future Intelligent Computing, Yantai, Shandong 264005, China
3.Key Laboratory of Intelligent Information Processing in Universities of Shandong（Shandong Technology and Business University）, Yantai, Shandong 264005, China
4.Information Science and Technology College, Dalian Maritime University, Dalian, Liaoning 116026, China

Online:2021-06-15 Published:2021-06-10

摘要/Abstract

摘要：

在标签均衡分布且标注样本足够多的数据集上，监督式分类算法通常可以取得比较好的分类效果。然而，在实际应用中样本的标签分布通常是不均衡的，分类算法的分类性能就变得比较差。为此，结合SLDA（Supervised LDA）有监督主题模型，提出一种不均衡文本分类新算法ITC-SLDA（Imbalanced Text Categorization based on Supervised LDA）。基于SLDA主题模型，建立主题与稀少类别之间的精确映射，以提高少数类的分类精度。利用SLDA模型对未标注样本进行标注，提出一种新的未标注样本的置信度计算方法，以及类别约束的采样策略，旨在有效采样未标注样本，最终降低不均衡文本的倾斜度，提升不均衡文本的分类性能。实验结果表明，所提方法能明显提高不均衡文本分类任务中的Macro-F1和G-mean值。

关键词: 有监督主题模型, 半监督学习, 不均衡文本, 分类

Abstract:

Supervised categorization algorithms can yield better categorization performance in datasets with enough and balanced labels. However, various real-world categorization tasks suffer from the class imbalance problem which has been known to hinder the learning performance of categorization algorithms. This paper, demonstrates that SLDA model is capable of solving the class imbalance problem by sampling unlabeled instances. In order to yield a better prediction performance with minority classes, the semantic relationship between topics and minority classes is derived by the SLDA topic model. An efficient way of calculating confidence and sampling valuable unlabeled instances is proposed. The proposed method reduces the skewness of the imbalanced datasets efficiently and improves the categorization performance of minority classes. Our experimental results show that the the proposed method, ITC-SLDA algorithm, can significantly improve Macro-F1 and G-mean values in imbalanced text categorization.

Key words: supervised topic model, semi-supervised learning, imbalanced text, categorization

唐焕玲，刘艳红，郑涵，窦全胜，鲁明羽. 融合SLDA主题模型的不均衡文本分类方法[J]. 计算机工程与应用, 2021, 57(12): 144-154.

TANG Huanling, LIU Yanhong, ZHENG Han, DOU Quansheng, LU Mingyu. Imbalanced Text Categorization Method with SLDA Topic Model[J]. Computer Engineering and Applications, 2021, 57(12): 144-154.

[1]	王永贵，李倩玉. 基于KNN-GBDT的混合协同过滤推荐算法[J]. 计算机工程与应用, 2021, 57(9): 103-108.
[2]	杨春霞，李欣栩，吴佳君，刘天宇. 基于注意力交互机制的层次网络情感分类[J]. 计算机工程与应用, 2021, 57(9): 134-139.
[3]	张韩钰，吴志昊，徐勇，陈斌. 增强卷积神经网络的人脸篡改检测方法[J]. 计算机工程与应用, 2021, 57(8): 220-224.
[4]	李俊丽. Spark平台下类别数据互信息计算的并行化[J]. 计算机工程与应用, 2021, 57(7): 95-100.
[5]	韩卫宇，程龙生. 结合马田系统-SVM的滚动轴承故障模式分类研究[J]. 计算机工程与应用, 2021, 57(6): 239-246.
[6]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[7]	韩东方，吐尔地·托合提，艾斯卡尔·艾木都拉. 问答系统中问句分类方法研究综述[J]. 计算机工程与应用, 2021, 57(6): 10-21.
[8]	黄金杰，蔺江全，何勇军，何瑾洁，王雅君. 局部语义与上下文关系的中文短文本分类算法[J]. 计算机工程与应用, 2021, 57(6): 94-100.
[9]	邹承明，胡佑璞. 引入生成对抗网络的室外场景单目深度估计[J]. 计算机工程与应用, 2021, 57(6): 176-183.
[10]	李硕，梁毅. 面向Spark的批处理应用执行时间预测模型[J]. 计算机工程与应用, 2021, 57(5): 79-87.
[11]	王凤琴，柯亨进. 卷积神经网络及其分析在抑郁症判别中的应用[J]. 计算机工程与应用, 2021, 57(5): 245-250.
[12]	万亚玲，钟锡武，刘慧，钱育蓉. 卷积神经网络在高光谱图像分类中的应用综述[J]. 计算机工程与应用, 2021, 57(4): 1-10.
[13]	陶体伟，刘明霞，王明亮，王琳琳，杨德运，张强. 基于有效距离的低秩表示[J]. 计算机工程与应用, 2021, 57(4): 141-147.
[14]	郑诚，董春阳，黄夏炎. 基于BTM图卷积网络的短文本分类方法[J]. 计算机工程与应用, 2021, 57(4): 155-160.
[15]	佘海龙，解山娟，邹静洁. 标准分数降维的3D-CNN高光谱遥感图像分类[J]. 计算机工程与应用, 2021, 57(4): 169-175.

融合SLDA主题模型的不均衡文本分类方法

Imbalanced Text Categorization Method with SLDA Topic Model

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics