基于时空倍频程卷积模块的轻量级视频显著性预测模型

doi:10.3778/j.issn.1002-8331.2405-0312

摘要/Abstract

摘要： 视频显著性预测是模拟人眼关注点的重要任务，对于视频编辑、虚拟现实和自动驾驶等应用至关重要。传统方法依赖于大型网络，限制了在资源受限设备上的应用。为解决上述问题，提出一种轻量级网络，通过设计轻量化的时空多尺度倍频程卷积模块，减少参数和计算需求，保持性能的同时提高了效率。结果表明，轻量级网络在资源受限设备上取得了与传统方法相媲美甚至更好的性能，具有较低的计算开销和较快的推理速度，预测结果更符合真实的人类眼动行为。

关键词: 视频显著性预测, 深度学习, 轻量级模型, 3D卷积

Abstract: Video saliency prediction is an important task for modelling the human eye’s focus and is crucial for applications such as video editing, virtual reality and autonomous driving. Traditional methods rely on large networks, limiting applications on resource-constrained devices. To address these issues, a lightweight network is proposed, which reduces parameters and computational requirements by designing a lightweight spatio-temporal multi-scale octave convolution module to maintain performance while improving efficiency. Experimental results show that this lightweight network achieves comparable or even better performance than traditional methods on resource-constrained devices, with lower computational overhead and faster inference speed, and the prediction results are more consistent with real human eye movement behaviour.

Key words: video saliency prediction, deep learning, lightweight model, 3D convolution

戴怡萱, 韩冰, 高新波, 韩怡园. 基于时空倍频程卷积模块的轻量级视频显著性预测模型[J]. 计算机工程与应用, 2025, 61(14): 248-255.

DAI Yixuan, HAN Bing, GAO Xinbo, HAN Yiyuan. Lightweight Video Saliency Prediction Model Driven by Spatio-Temporal Octave Convolution Module[J]. Computer Engineering and Applications, 2025, 61(14): 248-255.

参考文献

[1] JIANG M, HUANG S S, DUAN J Y, et al. SALICON: saliency in context[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 1072-1080.
[2] 鹿天然, 于凤芹, 杨慧中, 等. 基于显著性检测和稠密轨迹的人体行为识别[J]. 计算机工程与应用, 2018, 54(14): 163-167.
LU T R, YU F Q, YANG H Z, et al. Human action recognition based on dense trajectories with saliency detection[J]. Computer Engineering and Applications, 2018, 54(14): 163-167.
[3] 潘沛鑫, 潘中良. 结合显著性的主动轮廓图像分割[J]. 计算机工程与应用, 2021, 57(8): 225-230.
PAN P X, PAN Z L. Active contour image segmentation combined with saliency[J]. Computer Engineering and Applications, 2021, 57(8): 225-230.
[4] GAO D S, MAHADEVANV, VASCONCELOS N. The discriminant center-surround hypothesis for bottom-up saliency[C]//Advances in Neural Information Processing Systems, 2008: 497-504.
[5] GUO C L, MA Q, ZHANG L M. Spatio-temporal saliency detection using phase spectrum of quaternion Fourier transform[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2008: 1-8.
[6] RAHTU E, KANNALA J, SALO M, et al. Segmenting salient objects from images and videos[C]//Proceedings of the European Conference on Computer Vision. Berlin, Heidelberg: Springer, 2010: 366-379.
[7] WANG W, SHEN J, XIE J, et al. Revisiting video saliency prediction in the deep learning era[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2021, 43(1): 220-237.
[8] JIANG L, XU M, LIU T, et al. DeepVS: a deep learning based video saliency prediction approach[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 625-642.
[9] BAK C, KOCAK A, ERDEM E, et al. Spatio-temporal saliency networks for dynamic saliency prediction[J]. IEEE Transactions on Multimedia, 2018, 20(7): 1688-1698.
[10] JIANG L, XU M, WANG Z L. Predicting video saliency with object-to-motion CNN and two-layer convolutional LSTM[J]. arXiv:1709.06316, 2017.
[11] WANG W G, SHEN J B, GUO F, et al. Revisiting video saliency: a large-scale benchmark and a new model[J]. arXiv:1801.07424, 2018.
[12] MIN K, CORSO J. TASED-Net: temporally-aggregating spatial encoder-decoder network for video saliency detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 2394-2403.
[13] BELLITTO G, SALANITRI P F, PALAZZO S, et al. Hierarchical domain-adapted feature learning for video saliency prediction[J]. International Journal of Computer Vision, 2021, 129(12): 3216-3232.
[14] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[15] HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[J]. arXiv:1704.04861, 2017.
[16] MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNet V2: practical guidelines for efficient CNN architecture design[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer International Publishing, 2018: 122-138.
[17] TAN M X, LE Q V. EfficientNet: rethinking model scaling for convolutional neural networks[J]. arXiv:1905.11946, 2019.
[18] HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 386-397.
[19] 常振, 段先华, 鲁文超, 等. 基于多尺度的贝叶斯模型显著性检测[J]. 计算机工程与应用, 2020, 56(11): 207-213.
CHANG Z, DUAN X H, LU W C, et al. Multi-scale saliency detection based on Bayesian framework[J]. Computer Engineering and Applications, 2020, 56(11): 207-213.
[20] CHANG Q Y, ZHU S P. Human vision attention mechanism-inspired temporal-spatial feature pyramid for video saliency detection[J]. Cognitive Computation, 2023, 15(3): 856-868.
[21] CHEN Y P, FAN H Q, XU B, et al. Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 3434-3443.
[22] MATHE S, SMINCHISESCU C. Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(7): 1408-1424.
[23] RODRIGUEZ M D, AHMED J, SHAH M. Action MACH: a spatio-temporal maximum average correlation height filter for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2008: 1-8.
[24] MARSZALEK M, LAPTEV I, SCHMID C. Actions in context[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2009: 2929-2936.
[25] WENGUAN WANG, JIANBING SHEN. Deep visual attention prediction[J]. IEEE Transactions on Image Processing, 2018, 27(5): 2368-2378.
[26] LINARDOS P. Simple vs complex temporal recurrences for video saliency prediction[J]. arXiv:1907.01869, 2019.