计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (2): 73-83.DOI: 10.3778/j.issn.1002-8331.2405-0343

• 热点与综述 • 上一篇    下一篇

指代视频分割方法研究综述

魏彩颖,贾磊   

  1. 硅湖职业技术学院 计算机科学与技术学院,江苏 苏州 215332
  • 出版日期:2025-01-15 发布日期:2025-01-15

Methods for Referring Video Object Segmentation

WEI Caiying, JIA Lei   

  1. School of Computer Science and Technology, Silicon Lake College, Suzhou, Jiangsu 215332, China
  • Online:2025-01-15 Published:2025-01-15

摘要: 指代视频分割是计算机视觉和自然语言处理交叉领域的热点研究任务。目标是通过理解文本语义分割出给定视频的相关实体。与传统需预定义待分割物体类别的视觉分割任务不同,该任务不依赖于预定义的物体类别,而是通过理解给定的描述语句定位目标并分割。由于文本描述的内容随机且无分割好的视频帧当作参考,使得该任务极具挑战。虽然是新兴的跨媒体理解任务,但在安防监控、车辆追踪以及行人重识别等领域具有极高的应用前景并已有较多性能显著的方法提出。由于缺乏指代视频分割方法的研究综述,因此现有的指代视频分割方法被系统梳理和分析。具体地,根据研究思路的不同粗略地将解决方法分为四类:基于动态卷积、基于注意力机制、基于多层次信息学习和基于端到端序列预测的指代视频分割;对各类及各类内具体方法的性能进行定量和定性的分析;总结现有工作的不足以及未来可进行改进的思路。

关键词: 跨模态检索, 指代视频分割, 跨模态理解

Abstract: Referring video object segmentation (RVOS) is a hot research topic in the cross-media task spanning video and language. It aims to segment correlated entities in a given video with textual descriptions. Unlike conventional visual segmentation task that depends on pre-defined classes, the RVOS task is to understand the given expressions to locate and segment the referring entities without the help of pre-defined classes. Due to the randomness of the textual expressions and no pixel-wise masks serving as a reference, the RVOS task is more challenging than the conventional video segmentation task. Although RVOS is a new task in cross-modal understanding, it has essential application prospects for many tasks (e.g., security monitoring, vehicle tracking, person re-identification, and so on), thus increasing number of significant methods are being proposed consecutively. Specifically, the solutions are roughly divided into four categories according to the differences in research approaches, such as dynamic convolution based, attention based, multi-level information learning based and end-to-end sequence prediction based methods.  Later, qualitative and quantitative performance comparisons are presented for analysis. Lastly, the paper summarizes several issues existing in current methods, and then some suggestions are proposed to further improve the performance of RVOS tasks in future work.

Key words: cross-modal search, referring video object segmentation, cross-modal understanding