计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (23): 149-160.DOI: 10.3778/j.issn.1002-8331.2409-0007

• 模式识别与人工智能 • 上一篇    下一篇

负例伪标签分析用于视频Transformer的半监督动作识别研究

罗德艳,徐杨,左锋云,王明刚   

  1. 1.贵州大学 大数据与信息工程学院,贵阳 550025 
    2.遵义铝业股份有限公司,贵州 遵义 563100
  • 出版日期:2025-12-01 发布日期:2025-12-01

Negative Pseudo-Label Analysis for Semi-Supervised Action Recognition in Video Transformer

LUO Deyan, XU Yang, ZUO Fengyun, WANG Minggang   

  1. 1.College of Big Data and Information Engineering, Guizhou University, Guiyang 550025, China
    2.Zunyi Aluminum Stock Corporation Co., Ltd., Zunyi, Guizhou 563100, China
  • Online:2025-12-01 Published:2025-12-01

摘要: 动作识别作为一种模式识别技术,旨在通过分析视频或图像序列来识别和分类人体动作或行为。由于当前视频数量的激增,半监督学习被引入到动作识别的相关模型中,但分类效果仍然存在较大的提升空间。视觉Transformer在图像处理中相较于CNN有更好的效果。因此,改进视频Transformer在半监督学习中的训练范式。使用预训练权重对网络进行初始化,解决Transformer架构训练成本高的问题。引入logit标准化预处理技术,解除学生与教师之间logit的强制匹配限制。结合负学习技术对模型性能动态评估并分配负伪标签,解决模糊预测示例利用不充分的问题。实验结果表明,相对于传统卷积网络,改进的半监督视频Transformer网络在两个广泛的视频动作识别数据集UCF-101和HMDB-51上,能够取得更好的识别效果,且改进网络模型在UCF-101数据集1%和10%标签率上比基础模型分别提高6.4和1.5个百分点,在HMDB-51数据集40%、50%和60%标签率上分别提高5.2、3.6和3.1个百分点。

关键词: 动作识别, 半监督学习(SSL), 视觉Transformer, logit标准化预处理, 负学习

Abstract: Action recognition, as a pattern recognition technique, aims to identify and classify human actions or behaviors by analyzing video or image sequences. Given the exponential increase in video data, semi-supervised learning has been incorporated into action recognition models; however, there is still significant room for improvement in classification performance. The vision Transformer has demonstrated superior performance compared to CNN in image processing, thereby enhancing the training paradigm of video Transformers in semi-supervised learning. Firstly, pre-trained weights are employed to initialize the network, addressing the high training cost associated with Transformer architectures. Secondly, logit standardization preprocessing is introduced to remove the forced matching constraint between student and teacher logits. Finally, negative learning techniques are integrated to dynamically assess model performance and allocate negative pseudo-labels, addressing the issue of inadequate utilization of ambiguous prediction examples. The experimental results demonstrate that the improved semi-supervised video Transformer network achieves superior recognition performance compared to traditional convolutional networks on two widely used video action recognition datasets, UCF-101 and HMDB-51. Specifically, the improved network model outperforms the baseline model on the UCF-101 dataset by 6.4?and 1.5 percentage points at 1% and 10% label rates, respectively. On the HMDB-51 dataset, the improved model shows improvements of 5.2, 3.6, and 3.1 percentage points at 40%, 50%, and 60% label rates, respectively.

Key words: action recognition, semi-supervised learning(SSL), vision Transformer, logit standardization preprocessing, negative learning