指代视频分割方法研究综述

doi:10.3778/j.issn.1002-8331.2405-0343

摘要/Abstract

摘要： 指代视频分割是计算机视觉和自然语言处理交叉领域的热点研究任务。目标是通过理解文本语义分割出给定视频的相关实体。与传统需预定义待分割物体类别的视觉分割任务不同，该任务不依赖于预定义的物体类别，而是通过理解给定的描述语句定位目标并分割。由于文本描述的内容随机且无分割好的视频帧当作参考，使得该任务极具挑战。虽然是新兴的跨媒体理解任务，但在安防监控、车辆追踪以及行人重识别等领域具有极高的应用前景并已有较多性能显著的方法提出。由于缺乏指代视频分割方法的研究综述，因此现有的指代视频分割方法被系统梳理和分析。具体地，根据研究思路的不同粗略地将解决方法分为四类：基于动态卷积、基于注意力机制、基于多层次信息学习和基于端到端序列预测的指代视频分割；对各类及各类内具体方法的性能进行定量和定性的分析；总结现有工作的不足以及未来可进行改进的思路。

关键词: 跨模态检索, 指代视频分割, 跨模态理解

Abstract: Referring video object segmentation (RVOS) is a hot research topic in the cross-media task spanning video and language. It aims to segment correlated entities in a given video with textual descriptions. Unlike conventional visual segmentation task that depends on pre-defined classes, the RVOS task is to understand the given expressions to locate and segment the referring entities without the help of pre-defined classes. Due to the randomness of the textual expressions and no pixel-wise masks serving as a reference, the RVOS task is more challenging than the conventional video segmentation task. Although RVOS is a new task in cross-modal understanding, it has essential application prospects for many tasks (e.g., security monitoring, vehicle tracking, person re-identification, and so on), thus increasing number of significant methods are being proposed consecutively. Specifically, the solutions are roughly divided into four categories according to the differences in research approaches, such as dynamic convolution based, attention based, multi-level information learning based and end-to-end sequence prediction based methods. Later, qualitative and quantitative performance comparisons are presented for analysis. Lastly, the paper summarizes several issues existing in current methods, and then some suggestions are proposed to further improve the performance of RVOS tasks in future work.

Key words: cross-modal search, referring video object segmentation, cross-modal understanding

魏彩颖, 贾磊. 指代视频分割方法研究综述[J]. 计算机工程与应用, 2025, 61(2): 73-83.

WEI Caiying, JIA Lei. Methods for Referring Video Object Segmentation[J]. Computer Engineering and Applications, 2025, 61(2): 73-83.

参考文献

[1] DUTT J S, XIONG B, GRAUMAN K. FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos[C]//Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3664-3673.
[2] XU Y S, FU T J, YANG H K, et al. Dynamic video segmentation network[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6556-6565.
[3] ZHOU T, PORIKLI F, CRANDALL D J, et al. A survey on deep learning technique for video segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45: 7099-7122.
[4] LIU S, QI L, QIN H F, et al. Path aggregation network for instance segmentation[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8759-8768.
[5] YANG L J, FAN Y C, XU N. Video instance segmentation[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, 2019: 5188-5197.
[6] OH S W, LEE J Y, XU N, et al. Video object segmentation using space-time memory networks[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, 2019: 9226-9235.
[7] PERAZZI F, PONT-TUSET J, MCWILLIAMS B, et al. A benchmark dataset and evaluation methodology for video object segmentation[C]//Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, 2017: 724-732.
[8] SIBECHI R, BOOIJ O, BAKA N, et al. Exploiting temporality for semi-supervised video segmentation[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, 2019: 933-941.
[9] GAVRILYUK K, GHODRATI A, LI Z Y, et al. Actor and action video segmentation from a sentence[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2018: 5958-5966.
[10] LI D Z, LI R Q, WANG L J, et al. You only infer once: cross-modal meta-transfer for referring video object segmentation[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022: 1297-1305.
[11] GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8: 331-368.
[12] ZHU X Z, CHENG D Z, ZHANG Z, et al. An empirical study of spatial attention mechanisms in deep networks[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, 2019: 6688-6697.
[13] BRAUWERS G, FRASINCAR F. A general survey on attention mechanisms in deep learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35: 3279-3298.
[14] WANG H, DENG C, YAN J C, et al. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, 2019: 3938-3947.
[15] 蔡腾, 陈慈发, 董方敏. 结合Transformer和动态特征融合的低照度目标检测[J]. 计算机工程与应用, 2024, 60(9): 135-141.
CAI T, CHEN C F, DONG F G. Low-light object detection combining transformer and dynamic feature fusion[J]. Computer Engineering and Applications, 2024, 60(9): 135-141.
[16] BOTACH A, ZHELTONOZHSKII E, BASKIN C. End-to-end referring video object segmentation with multimodal transformers[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 4985-4995.
[17] YANG Z, WANG J Q, TANG Y S, et al. LAVT: language-aware vision transformer for referring image segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 18155-18165.
[18] WANG Z Q, LU Y, LI Q, et al. CRIS: clip-driven referring image segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 11686-11695.
[19] 邱爽, 赵耀, 韦世奎. 图像指代分割研究综述[J]. 信号处理, 2022, 38(6): 1144-1154.
QIU S, ZHAO Y, WEI S K. A survey of referring image segmentation[J]. Journal of Signal Processing, 2022, 38(6): 1144-1154.
[20] WANG H, DENG C, YANG Y. Context modulated dynamic networks for actor and action video segmentation with language queries[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020: 12152-12159.
[21] SEONGUK S, JOON-YOUNG L, BOHYUNG H. URVOS: unified referring video object segmentation network with a large-scale benchmark[C]//Proceedings of the 16th European Conference on Computer Vision, 2020: 208-223.
[22] HUI T R, HUANG S F, LIU S, et al. Collaborative spatial-temporal modeling for language-queried video actor segmentation[C]//Proceedings of the 34th IEEE Conference on Computer Vision and Pattern Recognition, 2021: 4187-4196.
[23] DING Z H, HUI T R, HUANG J C, et al. Language-bridged spatial-temporal interaction for referring video object segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 4964-4973.
[24] LIANG C, WANG W G, ZHOU T F, et al. Local-global context aware transformer for language-guided video segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(8): 10055-10069.
[25] YANG J, HUANG Y, NIU K, et al. Actor and action modular network for text-based video segmentation[J]. IEEE Transactions on Image Processing, 2022, 31: 4474-4489.
[26] ZHAO W B, WANG K, CHU X X, et al. Modeling motion with multimodal features for text-based video segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 11737-11746.
[27] WU D M, DONG X P, SHAO L, et al. Multi-level representation learning with semantic alignment for referring video object segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 4996-5005.
[28] YANG X, WANG H, XIE D, et al. Object-agnostic transformers for video referring segmentation[J]. IEEE Transactions on Image Processing, 2022, 31: 2839-2849.
[29] WU J N, JIANG Y, SUN P Z, et al. Language as queries for referring video object segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 4974-4984.
[30] MIAO B, BENNAMOUN M, GAO Y, et al. Spectrum-guided multi-granularity referring video object segmentation[C]//Proceedings of the 19th IEEE International Conference on Computer Vision, 2023: 920-930.
[31] WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 20-36.
[32] BRATTOLI B, TIGHE J, ZHDANOV F, et al. Rethinking zero-shot video classification: end-to-end training for realistic applications[C]//Proceedings of the 33 rd IEEE Conference on Computer Vision and Pattern Recognition, 2020: 4613-4623.
[33] GAO J, ZHANG T, XU C. Learning to model relationships for zero-shot video classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(10): 3476-3491.
[34] ZHAO Y, XIONG Y J, WANG L M, et al. Temporal action detection with structured segment networks[C]//Proceedings of the 16th IEEE International Conference on Computer Vision, 2017: 2914-2923.
[35] ZHAO C, DU D W, HOOGS A, et al. Open set action recognition via multi-label evidential learning[C]//Proceedings of the 36th IEEE Conference on Computer Vision and Pattern Recognition, 2023: 22982-22991.
[36] LIU N, NAN K, ZHAO W, et al. Learning complementary spatial-temporal transformer for video salient object detection[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(8): 10663-10673.
[37] HU M, JIANG K, WANG Z, et al. CycMuNet+: cycle-projected mutual learning for spatial-temporal video super-resolution[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 13376-13392 .
[38] 王彩红, 沈燕飞, 王毅, 等. 基于时空上下文的手势跟踪与识别[J]. 计算机工程与应用, 2016, 52(9): 202-207.
WANG C H, SHEN Y F, WANG Y, et al. Gesture tracking and recognition based on spatio-temporal context[J]. Computer Engineering and Applications, 2016, 52(9): 202-207.
[39] DU T, LUBOMIR B, ROB F, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the 15th IEEE International Conference on Computer Vision, 2015: 4489-4497.
[40] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308.
[41] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 21-37.
[42] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[43] HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]//Proceedings of the 16th IEEE International Conference on Computer Vision, 2017: 2961-2969.
[44] DOSOVITSKIY A, FISCHER P, ILG E, et al. FlowNet: learning optical flow with convolutional networks[C]//Proceedings of the 15th IEEE International Conference on Computer Vision, 2015: 2758-2766.
[45] ILG E, MAYER N, SAIKIA T, et al. FlowNet 2.0: evolution of optical flow estimation with deep networks[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2462-2470.
[46] ASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5998-6008.
[47] 杨文涛, 雷雨琦, 李星月, 等. 融合汉字输入法的BERT与BLCG的长文本分类研究[J]. 计算机工程与应用, 2024, 60(9): 196-202.
YANG W T, LEI Y Q, LI X Y, et al. Chinese long text classification model based on BERT fused Chinese input methods and BLCG[J]. Computer Engineering and Applications, 2024, 60(9): 196-202.
[48] LIU Z, NING J, CAO Y, et al. Video swin transformer[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 3202-3211.
[49] DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: attentive language models beyond a fixed-length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2978-2988.
[50] LIU Y, YU R, WANG J H, et al. Global spectral filter memory network for video object segmentation[C]//Proceedings of the 17th European Conference on Computer Vision, 2022: 648-665.
[51] KHOREVA A, ROHRBACH A, SCHIELE B. Video object segmentation with language referring expressions[C]//Proceedings of the 14th Asian Conference on Computer Vision, 2018: 123-141.
[52] XU C L, HSIEH S H, XIONG C M, et al. Can humans fly? action understanding with multiple classes of actors[C]//Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, 2015: 2264-2273.
[53] JHUANG H, GALL J, ZUFFI S, et al. Towards understanding action recognition[C]//Proceedings of the 14th IEEE International Conference on Computer Vision, 2013: 3192-3199.
[54] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision, 2014: 740-755.