
Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (23): 149-160.DOI: 10.3778/j.issn.1002-8331.2409-0007
• Pattern Recognition and Artificial Intelligence • Previous Articles Next Articles
LUO Deyan, XU Yang, ZUO Fengyun, WANG Minggang
Online:2025-12-01
Published:2025-12-01
罗德艳,徐杨,左锋云,王明刚
LUO Deyan, XU Yang, ZUO Fengyun, WANG Minggang. Negative Pseudo-Label Analysis for Semi-Supervised Action Recognition in Video Transformer[J]. Computer Engineering and Applications, 2025, 61(23): 149-160.
罗德艳, 徐杨, 左锋云, 王明刚. 负例伪标签分析用于视频Transformer的半监督动作识别研究[J]. 计算机工程与应用, 2025, 61(23): 149-160.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2409-0007
| [1] HAMAD A R, WOO W L, WEI B, et al. Overview ofHuman activity recognition using sensor data[C]//Advances in Computational Intelligence Systems. Cham: Springer Nature Switzerland, 2024: 380-391. [2] MENG Z Z, ZHANG M X, GUO C X, et al. Recent progress in sensing and computing techniques for human activity recognition and motion analysis[J]. Electronics, 2020, 9(9): 1357. [3] CHENG G, WAN Y, SAUDAGAR A N, et al. Advances in human action recognition: a survey[J]. arXiv:1501.05964, 2015. [4] JING L L, PARAG T, WU Z, et al. VideoSSL: semi-supervised learning for video classification[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Piscataway: IEEE, 2021: 1109-1118. [5] SHI B F, DAI Q, HOFFMAN J, et al. Temporal action detection with multi-level supervision[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 8002-8012. [6] SHI B F, DAI Q, MU Y D, et al. Weakly-supervised action localization by generative attention modeling[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 1006-1016. [7] SINGH A, CHAKRABORTY O, VARSHNEY A, et al. Semi-supervised action recognition with temporal contrastive learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 10384-10394. [8] BEAUCHEMIN S S, BARRON J L. The computation of optical flow[J]. ACM Computing Surveys, 1995, 27(3): 433-466. [9] XIAO J F, JING L L, ZHANG L, et al. Learning from temporal gradient for semi-supervised action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 3242-3252. [10] GAO G Y, LIU Z M, ZHANG G J, et al. DANet: semi-supervised differentiated auxiliaries guided network for video action recognition[J]. Neural Networks, 2023, 158: 121-131. [11] XU Y H, WEI F Y, SUN X, et al. Cross-model pseudo-labeling for semi-supervised action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 2949-2958. [12] DASS S D S, BARUA H B, KRISHNASAMY G, et al. ActNetFormer: Transformer-ResNet hybrid method for semi-supervised action recognition in videos[J]. arXiv:2404.06243, 2024. [13] LIU Z, NING J, CAO Y, et al. Video swin transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 3192-3201. [14] BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[J]. arXiv:2102.05095, 2021. [15] ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision Transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 6816-6826. [16] FEICHTENHOFER C. X3D: expanding architectures for efficient video recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2020: 200-210. [17] FEICHTENHOFER C, FAN H Q, MALIK J, et al. SlowFast networks for video recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 6201-6210. [18] HARA K, KATAOKA H, SATOH Y. Learning spatio-temporal features with 3D residual networks for action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops. Piscataway: IEEE, 2017: 3154-3160. [19] XU Y, ZHANG Q, ZHANG J, et al. Vitae: vision Transformer advanced by exploring intrinsic inductive bias[C]//Advances in Neural Information Processing Systems, 2021: 28522-28535. [20] WENG Z, YANG X, LI A, et al. Semi-supervised vision transformers[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 605-620. [21] SOHN K, BERTHELOT D, LI C L, et al. FixMatch: simplifying semi-supervised learning with consistency and confidence[J]. arXiv:2001.07685, 2020. [22] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020. [23] XING Z, DAI Q, HU H, et al. SVFormer: semi-supervised video transformer for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway:IEEE, 2023: 18816-18826. [24] DENG J, DONG W, SOCHER R, et al. ImageNet: a large-scale hierarchical image database[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 248-255. [25] TARVAINEN A, VALPOLA H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results[C]//Proceedings of the 31st Conference on Neural Information Processing Systems, 2017. [26] KIM T, OH J, KIM N Y, et al. Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation[C]//Proceedings of the 13th International Joint Conference on Artificial Intelligence, 2021: 2628-2635. [27] HE C Y, ANNAVARAM M, AVESTIMEHR S. Group knowledge transfer: federated learning of large CNNs at the edge[J]. arXiv:2007,14513, 2020. [28] SUN S, REN W, LI J, et al. Logit standardization in knowledge distillation[J]. arXiv:2403.01427, 2024. [29] KIM Y, YIM J, YUN J, et al. NLNL: negative learning for noisy labels[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 101-110. [30] CHEN Y H, TAN X, ZHAO B R, et al. Boosting semi-supervised learning by exploiting all unlabeled data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 7548-7557. [31] CUBUK E D, ZOPH B, MANé D, et al. AutoAugment: learning augmentation strategies from data[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2019: 113-123. [32] BERTHELOT D, CARLINI N, GOODFELLOW I, et al. MixMatch: a holistic approach to semi-supervised learning[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019: 5049-5059. [33] VERMA V, KAWAGUCHI K, LAMB A, et al. Interpolation consistency training for semi-supervised learning[J]. Neural Networks, 2022, 145: 90-106. [34] FRENCH G, OLIVER A, SALIMANS T. Milking cowmask for semi-supervised image classification[J]. arXiv:2003.12022, 2020. [35] XIONG B, FAN H Q, GRAUMAN K, et al. Multiview pseudo-labeling for semi-supervised learning from video[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 7189-7199. [36] ZOU Y L, CHOI J, WANG Q T, et al. Learning representational invariances for data-efficient action recognition[J]. Computer Vision and Image Understanding, 2023, 227: 103597. [37] ASSEFA M, JIANG W, ALEMU K G, et al. Actor-aware self-supervised learning for semi-supervised video representation learning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(11): 6679-6692. [38] DAVE I R, RIZVE M N, CHEN C, et al. TimeBalance: temporally-invariant and temporally-distinctive video representations for semi-supervised action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 2341-2352. [39] LI Y H, MAO H Z, GIRSHICK R, et al. Exploring plain vision transformer backbones forObject detection[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 280-296. [40] XIA C, WANG X, LV F, et al. Vit-comer: vision Transformer with convolutional multi-scale feature interaction for dense predictions[J]. arXiv:2403.07392, 2024. [41] 张彩灯, 徐杨, 莫寒, 等. 融合多注意力机制的语义调整风格迁移网络[J]. 计算机工程与应用, 2025, 61(8):204-214. ZHANG C D, XU Y, MO H, et al. Semantic adjustment style transfer network with multi-attention mechanism[J]. Computer Engineering and Applications, 2025, 61(8): 204-214. [42] ZHANG Y Y, LI X Y, LIU C H, et al. VidTr: video Transformer without convolutions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 13557-13567. [43] NEIMARK D, BAR O, ZOHAR M, et al. Video Transformer network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE, 2021: 3156-3165. [44] FAN H Q, XIONG B, MANGALAM K, et al. Multiscale vision Transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 6804-6815. [45] LI K, WANG Y, PENG G, et al. UniFormer: unified transformer for efficient spatial-temporal representation learning[C]//Proceedings of the International Conference on Learning Representations, 2021. [46] WANG R, WU Z, CHEN D, et al. Video mobile-former: video recognition with efficient global spatial?temporal modeling[J]. arXiv:2208.12257, 2022. [47] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. arXiv:1503.02531, 2015. [48] GOU J P, YU B S, MAYBANK S J, et al. Knowledge distillation: a survey[J]. International Journal of Computer Vision, 2021, 129(6): 1789-1819. [49] CHANDRASEGARAN K, TRAN N T, ZHAO Y, et al. Revisiting label smoothing and knowledge distillation compatibility: what was missing?[C]//Proceedings of the International Conference on Machine Learning, 2022: 2890-2916. [50] LIU J, LIU B, LI H, et al. Meta knowledge distillation[J]. arXiv:2202.07940, 2022. [51] GUO J, CHEN M H, HU Y, et al. Reducing the teacher-student gap via spherical knowledge disitllation[J]. arXiv:2010.07485, 2020. [52] RIZVE M N, DUARTE K, RAWAT Y S, et al. In defense of pseudo-labeling: an uncertainty-aware pseudo-label selection framework for semi-supervised learning[J]. arXiv:2101. 06329, 2021. [53] CHEN J, SHAH V, KYRILLIDIS A. Negative sampling in semi?supervised learning[C]//Proceedings of the International Conference on Machine Learning, 2020: 1704-1714. [54] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout: a simple way to prevent neural networks from overfitting[J]. The Journal of Machine Learning Research, 2014, 15(1): 1929-1958. [55] GRILL J B, STRUB F, ALTCHé F, ET AL. Bootstrap your own latent-a new approach to self-supervised learning[C]// Advances in Neural Information Processing Systems, 2020: 21271-21284. [56] ZHANG H, CISSE M, DAUPHIN Y N, et al. Mixup: beyond empirical risk minimization[J]. arXiv:1710.09412, 2017. [57] YUN S, HAN D, CHUN S, et al. CutMix: regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 6022-6031. [58] SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[J]. arXiv:1212.0402, 2012. [59] KUEHNE H, JHUANG H, STIEFELHAGEN R, et al. HMDB51: a large video database for human motion recognition[C]//Proceedings of the International Conference on Computer Vision. Piscataway: IEEE, 2013: 571-582. [60] TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image transformers & distillation through attention[J]. arXiv:2012.12877, 2020. [61] GOWDA S N, ROHRBACH M, KELLER F, et al. Learn2Augment: learning to composite videos for data augmentation in action recognition[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 242-259. [61] TONG A Y, TANG C, WANG W J. Semi-supervised action recognition from temporal augmentation using curriculum learning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(3): 1305-1319. [63] IQBAL O, CHAKRABORTY O, HUSSAIN A, et al. SITAR: Semi-supervised image Transformer for action recognition[J]. arXiv:2409.02910, 2024. |
| [1] | LU Shaotong, WANG Chuanxu. Hybrid Multi-Channel Associated Learning and Two-Branch Attention Fusion for Action Recognition [J]. Computer Engineering and Applications, 2025, 61(8): 145-154. |
| [2] | LIANG Chengwu , HU Wei, YANG Jie, JIANG Songqi, HOU Ning. Fusion of Spatio-Temporal Domain Knowledge and Data-Driven for Skeleton-Based Action Recognition [J]. Computer Engineering and Applications, 2025, 61(5): 165-176. |
| [3] | WANG Qi, HE Ning. Skeleton Action Recognition by Integrating Intrinsic Topology and Multi-Scale Time Features [J]. Computer Engineering and Applications, 2025, 61(4): 150-157. |
| [4] | JIANG Yuehan, CHEN Junjie, LI Hongjun. Review of Human Action Recognition Based on Skeletal Graph Neural Networks [J]. Computer Engineering and Applications, 2025, 61(3): 34-47. |
| [5] | WEN Shixiong, ZHI Min. Survey of Vision Transformers for Fine-Grained Image Classification [J]. Computer Engineering and Applications, 2025, 61(23): 24-37. |
| [6] | LI Zhijun, CHEN Qiulian. School of Computer and Electronic Information, Guangxi University, Nanning 530004, China [J]. Computer Engineering and Applications, 2025, 61(23): 351-359. |
| [7] | CHEN Xingqi, SONG Tao, ZOU Yangyang. Feature Refinement Skeletal Action Recognition Method Based on GCN and CNN Fusion [J]. Computer Engineering and Applications, 2025, 61(22): 226-234. |
| [8] | BAI Tian, GAO Yuehong, XIE Zhengguang, LI Hongjun. Multimodal Cross-View Contrastive Memory-Augmented Network for Self-Supervised Skeleton-Based Action Recognition [J]. Computer Engineering and Applications, 2025, 61(21): 225-233. |
| [9] | KANG Yu, HAO Xiaoli. Fine Grained Visual Classification Method for Combined Discriminative Region Features [J]. Computer Engineering and Applications, 2025, 61(2): 227-233. |
| [10] | WANG Cailing, YAN Jingjing, ZHANG Zhidong. Review on Human Action Recognition Methods Based on Multimodal Data [J]. Computer Engineering and Applications, 2024, 60(9): 1-18. |
| [11] | BIAN Cunling, LYU Weigang, FENG Wei. Skeleton-Based Human Action Recognition:History,Status and Prospects [J]. Computer Engineering and Applications, 2024, 60(20): 1-29. |
| [12] | ZHANG Hengwei, XU Linsen, CHEN Gen, WANG Zhihuan, SUI Xiang. Upper Limb Action Recognition Based on Transfer Learning and sEMG [J]. Computer Engineering and Applications, 2024, 60(20): 124-132. |
| [13] | NAN Yahui, HUA Qingyi. Local and Global View Occlusion Facial Expression Recognition Method [J]. Computer Engineering and Applications, 2024, 60(13): 180-189. |
| [14] | SUN Lulu, LIU Jianping, WANG Jian, XING Jialu, ZHANG Yue, WANG Chenyang. Survey of Vision Transformer in Fine-Grained Image Classification [J]. Computer Engineering and Applications, 2024, 60(10): 30-46. |
| [15] | LUO Huilan, CHEN Han. Spatial-Temporal Convolutional Attention Network for Action Recognition [J]. Computer Engineering and Applications, 2023, 59(9): 150-158. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||