计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (17): 148-157.DOI: 10.3778/j.issn.1002-8331.2112-0192

• 模式识别与人工智能 • 上一篇    下一篇

含缺失标签的大规模多标签分类算法

刘依璐,曹付元   

  1. 1.山西大学 计算机与信息技术学院,太原 030006
    2.山西大学 计算智能与中文信息处理教育部重点实验室,太原 030006
  • 出版日期:2022-09-01 发布日期:2022-09-01

Large-Scale Multi-Label Classification Algorithm with Missing Labels

LIU Yilu, CAO Fuyuan   

  1. 1.School of Computer and Information, Shanxi University, Taiyuan 030006, China
    2.Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006, China
  • Online:2022-09-01 Published:2022-09-01

摘要: 在对大规模多标签数据进行人工标注时极易产生标签的缺失。现有算法大多利用被所有实例共享的全局标签相关性来解决该问题,即对不同实例而言,标签之间的相关性是相同的。然而在实际应用中,不同实例的标签相关性并非完全相同,此时采用局部方式获取的标签相关性将更加准确。因此,本文提出一种基于局部标签相关性的解决方法。该方法利用局部标签相关性来恢复缺失标签,利用低秩矩阵分解技术来构造适用于大规模数据的分类器。此外,为了加快模型的训练,该方法将这两个过程融合到一个统一的框架中,并采用迭代优化的方式进行求解。大量实验结果表明,该方法在预测准确度上至少比现有算法高2个百分点,在训练速度上至少提升5个百分点。

关键词: 多标签分类, 缺失标签, 大规模标签, 局部标签相关性, 低秩矩阵分解

Abstract: It is easy to miss labels when labeling in large-scale multi-label data manually. Most of the existing algorithms use the global label correlations shared by all instances to solve this problem, that is, for all instances, the correlation between labels is the same. However, in practical applications, the label correlation in different instances is different, and the label correlation obtained by the local way is more accurate. Therefore, this paper proposes a solution based on local label correlation. The method exploits local label correlations to recover missing labels, and uses the low-rank matrix factorization to construct the classifier which suitable for large-scale data. Furthermore, to speed up the model training, the two processes are integrated into a unified framework and it is solved by iterative optimization. Extensive experimental results show that this method is at least 2 percentage points higher than existing algorithms in prediction accuracy and at least 5 percentage points higher in training speed.

Key words: multi-label classification, missing labels, large-scale labels, local label correlations, low-rank matrix factorization