Multi-Temporal Scales Consensus for Weakly Supervised Temporal Action Localization

doi:10.3778/j.issn.1002-8331.2201-0233

Abstract

Abstract: Weakly supervised temporal action localization model identifies the most distinctive video segments in the action instances, and also mistakes the background segment related to the video-level labels as an action, it is difficult to get a complete action proposal because of using the video-level label as the supervision signal. In order to further detect action segments, a multi-temporal scales consensus for weakly supervised temporal action localization method is proposed by analyzing the consistency of action segments on multi-temporal scales. Firstly, the features of RGB and optical flow are extracted from the input video frames, and a multi-temporal scale module is designed to model the video temporal relationship using convolution kernels of different sizes. Secondly, the predicted action labels with multi-temporal scales consensus are obtained by estimating the multi-time scale feature time class activation map and fusing the multi-branch time class activation map. Finally, in order to further optimize the action labels predicted by the model, the iterative optimization strategy is adopted to update the prediction labels in each iteration, and provide effective frame-level supervision signals for model training. Experiments are conducted on THUMOS14 and ActivityNet1.3 datasets. Experimental results show that the proposed network is superior to the state-of-the-art methods.

Key words: weakly supervised, temporal action localization, multi-temporal scales, consensus

摘要： 由于弱监督时序动作定位模型使用视频级的标签作为监督信号，模型在识别出动作实例中最具区分性的视频片段时，也会将和视频级标签有关的背景片段误认为是动作，难以产生完整的动作提议。为了进一步检测动作片段，通过分析动作片段在多时间尺度上标记的一致性，提出了一种多时间尺度一致性的弱监督时序动作定位方法。对输入的视频帧提取RGB和光流的特征，设计一种多时间尺度的模块，使用不同尺寸的卷积核建模视频的时序关系。通过估计多时间尺度特征的时间类激活图，并对多分支的时间类激活图进行融合，获得多时间尺度一致性的动作预测标签。为了进一步优化模型预测的动作标签，采用迭代优化策略，在每次迭代中更新预测标签，并为模型训练提供有效的帧级监督信号。在THUMOS14和ActivityNet1.3数据集上进行实验验证，实验结果表明，方法性能优于现有弱监督时序动作定位方法。

关键词: 弱监督, 时序动作定位, 多时间尺度, 一致性

GUO Wenbin, YANG Xingming, JIANG Zheyuan, WU Kewei, XIE Zhao. Multi-Temporal Scales Consensus for Weakly Supervised Temporal Action Localization[J]. Computer Engineering and Applications, 2023, 59(10): 151-161.

郭文斌, 杨兴明, 蒋哲远, 吴克伟, 谢昭. 多时间尺度一致性的弱监督时序动作定位[J]. 计算机工程与应用, 2023, 59(10): 151-161.

References

[1] VISHWAKARMA S，AGRAWAL A.A survey on activity recognition and behavior understanding in video surveillance[J].Visual Computer，2013，29（10）：983-1009.
[2] LEE Y J，GHOSH J，GRAUMAN K.Discovering important people and objects for egocentric video summarization[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition，2012：1346-1353.
[3] XIONG B，KALANTIDIS Y，GHADIYARAM D，et al.Less is more：learning highlight detection from video duration[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：1258-1267.
[4] JONES S，LING S，ZHANG J，et al.Relevance feedback for real-world human action retrieval[J].Pattern Recognition Letters，2012，33（4）：446-452.
[5] DALAL N，TRIGGS B，SCHMID C.Human detection using oriented histograms of flow and appearance[C]//Proceedings of the 9th European Conference on Computer Vision.Berlin，Heidelberg：Springer，2006：428-441.
[6] WANG H，SCHMID C.Action recognition with improved trajectories[C]//Proceedings of the 2013 IEEE International Conference on Computer Vision，2013：3551-3558.
[7] SIMONYAN K，ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[J].arXiv：1406.
2199，2014.
[8] TRAN D，BOURDEV L，FERGUS R，et al.Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision，2015：4489-4497.
[9] CARREIRA J，ZISSERMAN A.Quo vadis，action recognition? A new model and the kinetics dataset[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：6299-6308.
[10] SHOU Z，WANG D，CHANG S F.Temporal action localization in untrimmed videos via multi-stage CNNs[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：1049-1058.
[11] CHAO Y W，VIJAYANARASIMHAN S，SEYBOLD B，et al.Rethinking the Faster R-CNN architecture for temporal action localization[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：1130-1139.
[12] SHOU Z，CHAN J，ZAREIAN A，et al.CDC：convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：5734-5743.
[13] LIU Y，MA L，ZHANG Y，et al.Multi-granularity generator for temporal action proposal[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：3604-3613.
[14] ZENG R，HUANG W，TAN M，et al.Graph convolutional networks for temporal action localization[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision，2019：7094-7103.
[15] BAI Y，WANG Y，TONG Y，et al.Boundary content graph neural network for temporal action proposal generation[C]//Proceedings of the 16th European Conference on Computer Vision.Cham：Springer，2020：121-137.
[16] XU M，ZHAO C，ROJAS D S，et al.G-TAD：sub-graph localization for temporal action detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：10156-10165.
[17] LIN T，ZHAO X，SU H，et al.BSN：boundary sensitive network for temporal action proposal generation[C]//Proceedings of the 15th European Conference on Computer Vision，2018：3-19.
[18] LIN T，LIU X，LI X，et al.BMN：boundary-matching network for temporal action proposal generation[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision，2019：3889-3898.
[19] ZHAO C，THABET A K，GHANEM B.Video self-stitching graph network for temporal action localization[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision，2021：13658-13667.
[20] WANG L，XIONG Y，LIN D，et al.Untrimmednets for weakly supervised action recognition and detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：4325-4334.
[21] NGUYEN P，LIU T，PRASAD G，et al.Weakly supervised action localization by sparse temporal pooling network[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition，2018：6752-6761.
[22] PAUL S，ROY S，ROY-CHOWDHURY A K.W-TALC：weakly-supervised temporal activity localization and classification[C]//Proceedings of the 15th European Conference on Computer Vision，2018：563-579.
[23] SHOU Z，GAO H，ZHANG L，et al.AutoLoc：weakly-supervised temporal action localization in untrimmed videos[C]//Proceedings of the 15th European Conference on Computer Vision，2018：154-171.
[24] NGUYEN P X，RAMANAN D，FOWLKES C C.Weakly-supervised action localization with background modeling[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision，2019：5502-5511.
[25] LEE P，UH Y，BYUN H.Background suppression network for weakly-supervised temporal action localization[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence，2020：11320-11327.
[26] RASHID M，KJELLSTROM H，LEE Y J.Action graphs：weakly-supervised action localization with graph convolution networks[C]//Proceedings of the 2020 IEEE/CVF Winter Conference on Applications of Computer Vision，2020：615-624.
[27] HUANG L，HUANG Y，OUYANG W，et al.Two-branch relational prototypical network for weakly supervised temporal action localization[J].IEEE Transactions on Pattern Analysis and Machine Intelligence，2022，44（9）：5729-5746.
[28] LIU Z，WANG L，TANG W，et al.Weakly supervised temporal action localization through learning explicit subspaces for action and context[J].arXiv：2103.16155，2021.
[29] XU Y，ZHANG C，CHENG Z，et al.Segregated temporal assembly recurrent networks for weakly supervised multiple action detection[C]//Proceedings of the 33rd AAAI Conference on Artificial Intelligence，2019：9070-9078.
[30] NARAYAN S，CHOLAKKAL H，KHAN F S，et al.3C-Net：category count and center loss for weakly-supervised action localization[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision，2019：8679-8687.
[31] ZHANG X Y，SHI H，LI C，et al.Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence，2020：12886-12893.
[32] SINGH K K，LEE Y J.Hide-and-Seek：forcing a network to be meticulous for weakly-supervised object and action localization[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision，2017：3544-3553.
[33] LIU D，JIANG T，WANG Y.Completeness modeling and context separation for weakly supervised temporal action localization[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition，2019：1298-1307.
[34] ZHAI Y，WANG L，TANG W，et al.Two-stream consensus network for weakly-supervised temporal action localization[C]//Proceedings of the 16th European Conference on Computer Vision.Cham：Springer，2020：37-54.
[35] ROLNICK D，VEIT A，BELONGIE S，et al.Deep learning is robust to massive label noise[J].arXiv：1705.10694，2017.
[36] TANG P，WANG X，BAI X，et al.Multiple instance detection network with online instance classifier refinement[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：2843-2851.
[37] IDREES H，ZAMIR A R，JIANG Y G，et al.The THUMOS challenge on action recognition for videos “in the wild”[J].Computer Vision and Image Understanding，2017，155：1-23.
[38] CABA HEILBRON F，ESCORCIA V，GHANEM B，et al.ActivityNet：a large-scale video benchmark for human activity understanding[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition，2015：961-970.
[39] WEDEL A，POCK T，ZACH C，et al.An improved algorithm for TV-l 1 optical flow[M]//Statistical and geometrical approaches to visual motion analysis.Berlin，Heidelberg：Springer，2009：23-45.
[40] LECUN Y，BOTTOU L，BENGIO Y，et al.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE，1998，86（11）：2278-2324.
[41] KRIZHEVSKY A，SUTSKEVER I，HINTON G E.Image-
Net classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems 25，2012：1106-1114.
[42] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.