Review of Video Salient Object Detection Based on Deep Neural Networks

doi:10.3778/j.issn.1002-8331.2405-0035

Abstract

Abstract: Video salient object detection is one of the widely studied research directions in the field of computer vision, which aims to locate and segment the most salient objects or regions in video. The existing video salient object detection methods mainly extract spatiotemporal features from dynamic video sequences for saliency prediction by constructing deep neural networks. A comprehensive review of video salient object detection methods based on deep learning is conducted. Firstly, the basic concepts and application scenarios of video salient object detection are elaborated. Secondly, the video salient object detection methods based on deep learning are classified, and analyzed and discussed in depth by category. Subsequently, authoritative benchmark test datasets and evaluation metrics in the field of video salient object detection are introduced, and quantitative and qualitative experimental comparative analysis and discussion are conducted on the most advanced models on these benchmark datasets. Finally, the challenges faced by video salient object detection are summarized, and its future development directions are discussed.

Key words: video salient object detection, spatiotemporal features, deep learning

摘要： 视频显著目标检测作为计算机视觉领域广泛关注的研究方向之一，其旨在定位和分割出视频中最显著的目标或区域。现有视频显著目标检测方法主要通过构建深度神经网络来从动态视频序列中提取时空特征进行显著性预测。对基于深度学习的视频显著目标检测方法进行全面梳理，阐述了视频显著目标检测的基本概念及应用场景；对基于深度学习的视频显著目标检测方法进行了分类，并按类别进行深入的分析和讨论；对视频显著目标检测领域的权威基准测试数据集及评价指标进行介绍，并在这些基准数据集上对最先进的模型进行了定量和定性实验对比分析和讨论；总结了视频显著目标检测面临的挑战，对其未来发展方向进行了展望。

关键词: 视频显著目标检测, 时空特征, 深度学习

YANG Chengbang, WANG Anzhi, REN Chunhong, TANG Jieliang. Review of Video Salient Object Detection Based on Deep Neural Networks[J]. Computer Engineering and Applications, 2024, 60(19): 68-79.

杨成帮, 王安志, 任春洪, 唐洁亮. 基于深度神经网络的视频显著目标检测综述[J]. 计算机工程与应用, 2024, 60(19): 68-79.

References

[1] 陈琴, 朱磊, 后云龙, 等. 基于深度中心邻域金字塔结构的显著目标检测[J]. 模式识别与人工智能, 2020, 33(6): 496-506.
CHEN Q, ZHU L, HOU Y L, et al. Salient object detection based on deep center-surround pyramid[J]. Pattern Recognition and Artificial Intelligence, 2020, 33(6): 496-506.
[2] 王正文, 宋慧慧, 樊佳庆, 等. 基于语义引导特征聚合的显著性目标检测网络[J]. 自动化学报, 2023, 49(11): 2386-2395.
     WANG Z W, SONG H H, FAN J Q, et al. Semantic guided feature aggregation network for salient object detection[J]. Acta Automatica Sinica, 2023, 49(11): 2386-2395.
[3] 张冬明, 靳国庆, 代锋, 等. 基于深度融合的显著性目标检测算法[J]. 计算机学报, 2019, 42(9): 2076-2086.
      ZHANG D M, JIN G Q, DAI F, et al. Salient object detection based on deep fusion of hand-crafted features[J]. Chinese Journal of Computers, 2019, 42(9): 2076-2086.
[4] 何伟, 潘晨. 注意力引导网络的显著性目标检测[J]. 中国图象图形学报, 2022, 27(4): 1176-1190.
      HE W, PAN C. The salient object detection based on attention-guided network[J]. Journal of Image and Graphics, 2022, 27(4): 1176-1190.
[5] 陈正, 赵晓丽, 张佳颖, 等. 基于跨模态特征融合的RGB-D显著性目标检测[J]. 计算机辅助设计与图形学学报, 2021, 33(11): 1688-1697.
     CHEN Z, ZHAO X L, ZHANG J Y, et al. RGB-D image saliency detection based on cross-model feature fusion[J]. Journal of Computer-Aided Design & Computer Graphics, 2021, 33(11): 1688-1697.
[6] 孟令兵, 袁梦雅, 时雪涵, 等. 跨模态融合和边界可变形卷积引导的RGB-D显著性目标检测[J]. 电子学报, 2023, 51(11): 3155-3166.
      MENG L B, YUAN M Y, SHI X H, et al. RGB-D salient object detection based on cross-modal fusion and boundary deformable convolution guidance[J]. Acta Electonica Sinica, 2023, 51(11): 3155-3166.
[7] 高悦, 戴蒙, 张晴. 基于多模态特征交互的RGB-D显著性目标检测[J]. 计算机工程与应用, 2024, 60(2): 211-220.
      GAO Y, DAI M, ZHANG Q. RGB-D salient object detection based on multi-modal feature interaction[J]. Computer Engineering and Applications, 2024, 60(2): 211-220.
[8] ZHANG D, JAVED O, SHAH M. Video object segmentation through spatially accurate and temporally dense extraction of primary object regions[C]//Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013: 628-635.
[9] 范登平, 季葛鹏, 秦雪彬, 等. 认知规律启发的物体分割评价标准及损失函数[J]. 中国科学：信息科学, 2021, 51(9): 1475-1489.
      FAN D P, JI G P, QIN X B, et al. Cognitive vision inspired object segmentation metric and loss function[J]. Scientia Sinica Informationis, 2021, 51(9): 1475-1489.
[10] WU Y, ZHENG N N, YUAN Z J, et al. Detection of salient objects with focused attention based on spatial and temporal coherence[J]. Chinese Science Bulletin, 2011, 56: 1055-1062.
[11] ZHOU Z K, PEI W J, LI X, et al. Saliency-associated object tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 9866-9875.
[12] ZHANG Z Y, FIDLER S, URTASUN R. Instance-level segmentation for autonomous driving with deep densely connected MRFs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 669-677.
[13] ITTI L, KOCH C, NIEBUR E. A model of saliency-based visual attention for rapid scene analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(11): 1254-1259.
[14] GUO C L, MA Q, ZHANG L M. Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform[C]//Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008: 1-8.
[15] MAHADEVAN V, VASCONCELOS N. Spatiotemporal saliency in dynamic scenes[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 32(1): 171-177.
[16] FAN D P, WANG W G, CHENG M M, et al. Shifting more attention to video salient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 8554-8564.
[17] CHEN C L Z, WANG G T, PENG C, et al. Exploring rich and efficient spatial temporal interactions for real-time video salient object detection[J]. IEEE Transactions on Image Processing, 2021, 30: 3995-4007.
[18] CONG R M, SONG W Y, LEI J J, et al. PSNet: parallel symmetric network for video salient object detection[J]. IEEE Transactions on Emerging Topics in Computational Intelligence, 2022, 7(2): 402-414.
[19] 丛润民, 雷建军, 付华柱, 等. 视频显著性检测研究进展[J]. 软件学报, 2018, 29(8): 2527-2544.
       CONG R M, LEI J J, FU H Z, et al. Research progress of video saliency detection[J]. Journal of Software, 2018, 29(8): 2527-2544.
[20] WANG Q, ZHANG L, LI Y, et al. Overview of deep-learning based methods for salient object detection in videos[J]. Pattern Recognition, 2020, 104: 107340.
[21] 胡晓辉, 关山. 视频序列中运动目标检测算法[J]. 计算机工程与应用, 2011, 47(16): 166-168.
        HU X H, GUAN S. Detection algorithm of moving target in video sequences[J]. Computer Engineering and Applications, 2011, 47(16): 166-168.
[22] 徐晶, 刘鹏, 刘家锋, 等. 一种受雨滴影响的运动目标检测方法[J]. 计算机研究与发展, 2009, 46(11): 1885-1892.
        XU J, LIU P, LIU J F, et al. A detection algorithm for rain-affected moving objects[J]. Journal of Computer Research and Development, 2009, 46(11): 1885-1892.
[23] 秦利斌, 刘纯平, 王朝晖, 等. 一种改进的时空线索的视频显著目标检测方法[J]. 计算机工程与应用, 2015, 51(16): 161-165.
       QIN L B, LIU C P, WANG Z H, et al. Approach of detecting salient objects in videos using spatiotemporal cues[J]. Computer Engineering and Applications, 2015, 51(16): 161-165.
[24] CHEN C L Z, LI S, WANG Y G, et al. Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion[J]. IEEE Transactions on Image Processing, 2017, 26(7): 3156-3170.
[25] 徐屹伟, 刘政怡, 赵悉超. 基于简单帧选择的显著性检测方法[J]. 计算机工程与应用, 2019, 55(20): 177-183.
       XU Y W, LIU Z Y, ZHAO X C. Saliency detection method based on simple frame selection[J]. Computer Engineering and Applications, 2019, 55(20): 177-183.
[26] LIU Z, LI J H, YE L W, et al. Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2016, 27(12): 2527-2542.
[27] LIU T, SUN J, ZHENG N N, et al. Learning to detect a salient object[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 33(2): 353-367.
[28] XUE Y W, GUO X J, CAO X C. Motion saliency detection using low-rank and sparse decomposition[C]//Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012: 1485-1488.
[29] WANG W G, SHEN J B, SHAO L. Consistent video saliency using local gradient flow optimization and global refinement[J]. IEEE Transactions on Image Processing, 2015, 24(11): 4185-4196.
[30] WANG W G, SHEN J B, PORIKLI F. Saliency-aware geodesic video object segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 3395-3402.
[31] LONG J, SHELHAMER E, DARRELL T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(4): 640-651.
[32] NIE G Y, GUO Y N, LIU Y, et al. Real-time salient object detection based on fully convolutional networks[C]//Proceedings of the 12th Chinese Conference on Image and Graphics Technologies, Beijing, China, Jun 30-Jul 1, 2017. Singapore: Springer, 2018: 189-198.
[33] WANG W G, SHEN J B, SHAO L. Video salient object detection via fully convolutional networks[J]. IEEE Transactions on Image Processing, 2017, 27(1): 38-49.
[34] SUN M J, ZHOU Z Q, HU Q H, et al. SG-FCN: a motion and memory-based deep learning model for video saliency detection[J]. IEEE Transactions on Cybernetics, 2018, 49(8): 2900-2911.
[35] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[36] SHI X J, CHEN Z R, WANG H, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting[M]. [S. l.]: MIT Press, 2015.
[37] BALLAS N, YAO L, PAl C, et al. Delving deeper into convolutional networks for learning video representations[J]. arXiv:1511.06432, 2015.
[38] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[39] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017, 30.
[40] LI G B, XIE Y, WEI T H, et al. Flow guided recurrent neural encoder for video salient object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 3243-3252.
[41] CAI J P, LIN S. A novel hybrid model for video salient object detection[C]//Proceedings of the 2020 International Conference on Computer Engineering and Intelligent Control (ICCEIC), 2020: 275-279.
[42] BI H B, YANG L N, ZHU H H, et al. STEG-Net: spatiotemporal edge guidance network for video salient object detection[J]. IEEE Transactions on Cognitive and Developmental Systems, 2021, 14(3): 902-915.
[43] SONG H M, WANG W G, ZHAO S Y, et al. Pyramid dilated deeper ConvLSTM for video salient object detection[C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 715-731.
[44] LIU B, MU K Z, XU M Z, et al. A novel spatiotemporal attention enhanced discriminative network for video salient object detection[J]. Applied Intelligence, 2022, 52(6): 5922-5937.
[45] YAN P X, LI G B, XIE Y, et al. Semi-supervised video salient object detection using pseudo-labels[J]. arXiv:1908.04051, 2019.
[46] WANG Z Y, LI J P, LI J X. Dual temporal memory network for video salient object detection[C]//Proceedings of the International Conference on Image and Graphics. Cham: Springer Nature Switzerland, 2023: 385-396.
[47] FANG Y M, DING G Q, WEN W Y, et al. Salient object detection by spatiotemporal and semantic features in real-time video processing systems[J]. IEEE Transactions on Industrial Electronics, 2019, 67(11): 9893-9903.
[48] CHEN T Y, XIAO J, HU X G, et al. Spatiotemporal context-aware network for video salient object detection[J]. Neural Computing and Applications, 2022, 34(19): 16861-16877.
[49] FANG Y M, DING G Q, LI J, et al. Deep3Dsaliency: deep stereoscopic video saliency detection model by 3D convolutional networks[J]. IEEE Transactions on Image Processing, 2018, 28(5): 2305-2318.
[50] WANG Z Y, LI J X, PAN Z F. Cross complementary fusion network for video salient object detection[J]. IEEE Access, 2020, 8: 201259-201270.
[51] DOSOVITSKIY A, FISCHER P, LLG E, et al. FlowNet: learning optical flow with convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 2758-2766.
[52] LLG E, MAYER N, SAIKIA T, et al. Flownet 2.0: evolution of optical flow estimation with deep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2462-2470.
[53] TEED Z, DENG J. RAFT: recurrent all-pairs field transforms for optical flow[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, Aug 23-28, 2020. Cham: Springer, 2020: 402-419.
[54] SUN D Q, YANG X D, LIU M Y, et al. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8934-8943.
[55] LI H F, CHEN G Q, LI G B, et al. Motion guided attention for video salient object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 7274-7283.
[56] JIAO Y X, WANG X, CHOU Y C, et al. Guidance and teaching network for video salient object detection[C]//Proceedings of the 2021 IEEE International Conference on Image Processing(ICIP), 2021: 2199-2203.
[57] REN S C, HAN C, YANG X, et al. TENet: triple excitation network for video salient object detection[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, Aug 23-28, 2020. Cham: Springer, 2020: 212-228.
[58] LIU J, WANG J X, WANG W K, et al. DS-Net: dynamic spatiotemporal network for video salient object detection[J]. Digital Signal Processing, 2022, 130: 103700.
[59] CHEN P J, LAI J H, WANG G C, et al. Confidence-guided adaptive gate and dual differential enhancement for video salient object detection[C]//Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021: 1-6.
[60] HUANG L L, YAN P X, LI G B, et al. Attention embedded spatio-temporal network for video salient object detection[J]. IEEE Access, 2019, 7: 166203-166213.
[61] ZHAO W B, ZHANG J, LI L, et al. Weakly supervised video salient object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 16826-16835.
[62] TANG Y, LI Y M, XING G L. Video salient object detection via adaptive local-global refinement[J]. arxiv:2104.14360, 2021.
[63] MIN D Y, ZHANG C, LU Y K, et al. Local-global interaction and progressive aggregation for video salient object detection[C]//Proceedings of the International Conference on Neural Information Processing. Singapore: Springer Nature Singapore, 2022: 101-113.
[64] HUANG K, TIAN C W, XU Z J, et al. Motion context guided edge-preserving network for video salient object detection[J]. Expert Systems with Applications, 2023, 233: 120739.
[65] HUANG K, XU Z J. Lightweight video salient object detection via channel-shuffle enhanced multi-modal fusion network[J]. Multimedia Tools and Applications, 2024, 83(1): 1025-1039.
[66] GAO S Y, XING H Z, ZHANG W, et al. Weakly supervised video salient object detection via point supervision[C]//Proceedings of the 30th ACM International Conference on Multimedia, 2022: 3656-3665.
[67] HUANG K, TIAN C W, SU J Y, et al. Transformer-based cross reference network for video salient object detection[J]. Pattern Recognition Letters, 2022, 160: 122-127.
[68] MIN D Y, ZHANG C, LU Y K, et al. Mutual-guidance transformer-embedding network for video salient object detection[J]. IEEE Signal Processing Letters, 2022, 29: 1674-1678.
[69] LIU N, NAN K P, ZHAO W B, et al. Learning complementary spatial-temporal transformer for video salient object detection[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(8): 10663-10673.
[70] YANG H, MU N, GUO J, et al. Video salient object detection via self-attention-guided multilayer cross-stack fusion[J]. Multimedia Tools and Applications, 2023: 1-14.
[71] OCHS P, MALIK J, BROX T. Segmentation of moving objects by long term video analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 36(6): 1187-1200.
[72] PERAZZI F, PONT-TUSET J, MCWILLIAMS B, et al. A benchmark dataset and evaluation methodology for video object segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 724-732.
[73] LI F X, KIM T, HUMAYUN A, et al. Video segmentation by tracking many figure-ground segments[C]//Proceedings of the IEEE International Conference on Computer Vision, 2013: 2192-2199.
[74] LI J, XIA C Q, CHEN X W. A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection[J]. IEEE Transactions on Image Processing, 2017, 27(1): 349-364.
[75] KIM H, KIM Y, SIM J Y, et al. Spatiotemporal saliency detection for video sequences based on random walk with restart[J]. IEEE Transactions on Image Processing, 2015, 24(8): 2552-2564.
[76] TSAI D, FLAGG M, NAKAZAWA A, et al. Motion coherent tracking using multi-label MRF optimization[J]. International Journal of Computer Vision, 2012, 100: 190-202.
[77] ACHANTA R, HEMAMI S, ESTRADA F, et al. Frequency-tuned salient region detection[C]//Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009: 1597-1604.
[78] FAN D P, CHENG M M, LIU Y, et al. Structure-measure: a new way to evaluate foreground maps[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017: 4548-4557.
[79] PERAZZI F, KR?HENBüHL P, PRITCH Y, et al. Saliency filters: contrast based filtering for salient region detection[C]//Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012: 733-740.