计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (5): 163-171.DOI: 10.3778/j.issn.1002-8331.2008-0349

• 模式识别与人工智能 • 上一篇    下一篇

运用模态融合的半监督广义零样本学习

林爽,王晓军   

  1. 1.南京邮电大学 计算机学院,南京 210023
    2.南京邮电大学 物联网学院,南京 210003
  • 出版日期:2022-03-01 发布日期:2022-03-09

Semi-supervised Generalized Zero-Shot Learning Using Modal Fusion

LIN Shuang, WANG Xiaojun   

  1. 1.School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
    2.School of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
  • Online:2022-03-01 Published:2022-03-09

摘要: 映射域漂移和偏见性预测问题使得现有的方案无法很好地应对广义零样本学习挑战。在CADA-VAE模型的基础上,提出了基于模态融合的半监督学习方案,就如何利用未标注样本及语义辅助模型进行模态内自学习提供了一种思路。该方案使用潜层向量空间作为视觉和语义模态融合的桥梁,提出了视觉质心和异类语义潜层向量概念,用以指导模态间互学习;在交叉重构环节,以视觉质心为轴,将语义潜层向量交叉重构为此类的视觉特征;在特征编码环节,沿异类语义潜层向量的负方向将视觉特征编码为潜层向量;保证了生成的样本具有多样性的同时不失类间区分度。通过在三个基准数据集上进行对比实验,证明了该模型在识别精度上优于当下主流方案,并且能够很好地应对标注样本稀少的情况。

关键词: 广义零样本学习, 模态融合, 半监督学习, 视觉质心

Abstract: Projection domain drift and prejudice prediction problems make the existing schemes unable to meet the challenge of generalized zero-shot learning well. Based on the CADA-VAE, this article proposes a semi-supervised learning scheme based on modal fusion which provides a way of how to use unlabeled samples and semantic help the model for intra-modal self-learning. This solution uses the latent layer vector space as a bridge for the fusion of visual and semantic modalities, and proposes the concept of visual centroid and heterogeneous semantic latent layer vectors to guide mutual learning between modalities. In the cross-reconstruction link, the semantic latent layer vector is cross-reconstructed into visual features by taking the visual centroid as the axis; in the feature coding link, the visual feature is coded as a latent layer vector along the opposite direction of the heterogeneous semantic latent layer vector. This scheme ensures the generated samples have diversity while not losing the discrimination between classes. Comparative experiments on three benchmark data sets proves that this model is superior to the current mainstream solutions in recognition accuracy, and it can cope with the scarcity of labeled samples.

Key words: generalized zero-shotlearning, modal fusion, semi-supervised learning, visual centroid